7/17/2011

07-17-11 - CLH list-based lock

The multi-way ticket lock we just did is very similar to some classic spin locks. I found this nice page : scalable synchronization pseudocode ( and parent page at cs.rochester ) ( and similar material covered here, but with nice drawings : Mutual Exclusion: Classical Algorithms for Locks (PDF) ).

I wanted to see how the classic MCS queue lock compares to my per-thread-mpsc lock ; the answer is not much. The classic queue locks are really closer kin to the multi-way ticket lock. I'll try to show that now. The MCS lock is probably more well known, but the CLH lock is simpler, so I'll deal with that.

The idea of these locks is to avoid the heavy cache contention inherent to the basic single-variable gate locks. To solve that, the idea is to use a distributed gate; basically one gate variable per waiter, and it's the responsibility of the unlocker to open the gate for the next waiter. So there has to be some kind of linked list so that the unlocker can find the next waiter. And these locks will be inherently FIFO and SCHED_OTHER and all that. (these are really only appropriate for kernels or kernel-like environments)

The CLH algorithm is usually described as a linked list, with the "head" of the list being the node that currently has access to the mutex, and the "tail" being the variable held in the lock struct. When new waiters come in, they tack onto the tail, thus it's FIFO.

There's a node for each waiter, and each node contains the gate for the guy after me :


struct clh_node
{
    // list is linked from tail backwards :
    std::atomic<clh_node *> prev;
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node()
    {
        prev($).store(0);
        succ_must_wait($).store(0);
    }
};

we also need a way of providing a node per-thread ! *per-lock* ! ; this is different than my event-queue-mutex that just needs a node *per-thread* ; the reason is that the nodes in CLH keep getting used even after you unlock, so you can't just reuse them. However, you can free some node when you unlock - just not necessarily the one you passed in. So anyhoo, we need some struct to pass in this node for us, here it is :

struct ThreadNode
{
    std::atomic<clh_node *> pNode;
    
    ThreadNode()
    {
        pNode($).store( new clh_node );
    }
    ~ThreadNode()
    {
        // note that the pNode I delete might not be the one I created
        //  so don't try to hold it by value
        clh_node * n = pNode($).exchange(0, std::mo_relaxed);
        delete n;
    }
};

this could be in the TLS, or it could be in the mutex::guard , or whatever.

Okay, now that we have our helpers we can write the code. When the mutex is held, the tail node will have succ_must_wait = 1 , when you take the lock you stick yourself on the tail and then wait on your predecessor. To unlock the mutex you just set succ_must_wait = 0 on yourself, and that allows the guy after you to go :


struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_tail;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_tail($).store( new clh_node );
    }
    ~clh_mutex()
    {
        clh_node * n = m_tail($).exchange(0);
        delete n;
    }
    
    void lock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        //me->prev($).store(0, std::mo_relaxed );
        clh_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        me->prev($).store(pred, std::mo_relaxed );
        
        // wait on predecessor's flag -
        //  this is why pred can't free himself
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
    }
    
    void unlock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
        
        clh_node * pred = me->prev($).load(std::mo_relaxed);
        me->succ_must_wait($).store( 0, std::mo_release );
        // take pred's node :
        //  this leaves my node allocated, since succ is still looking at it
        I->pNode($).store( pred, std::mo_relaxed );
    }

};

okay, I think this is reasonably self-explanatory. BTW the reason why the classical locks are the way they are is often to avoid test-and-set ops, which they didn't have or were very expensive; here we use only one exchange, the rest is just loads and stores.

That matches the classical algorithm description, but it's a lot more expensive that necessary. The first thing you might notice is that we don't actually need to store the linked list at all. All we need to do is get "pred" from lock to unlock. So you can either store it in the mutex struct, or put it in the "guard" (ThreadNode in this case) ; I think putting it in the guard is better, but I'm going to put it in the mutex right now because it's more analogous to our next step :


struct clh_node
{
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node() { succ_must_wait($).store(0); }
};

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_lock;
    std::atomic<clh_node *> m_lock_pred;
    std::atomic<clh_node *> m_lock_holder;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store( new clh_node );
        m_lock_pred($).store( 0 );
        m_lock_holder($).store( 0 );
    }
    ~clh_mutex()
    {
        clh_node * n = m_lock($).exchange(0);
        delete n;
    }

    clh_node * alloc_slot()
    {
        return new clh_node;
    }
    void free_slot(clh_node * p)
    {
        delete p;
    }
    
    void lock()
    {
        clh_node * me = alloc_slot();
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        clh_node * pred = m_lock($).exchange(me, std::mo_acq_rel);
        
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        m_lock_holder($).store(me, std::mo_relaxed );
        m_lock_pred($).store(pred, std::mo_relaxed );
    }
    
    void unlock()
    {
        clh_node * me = m_lock_holder($).load(std::mo_relaxed);
        
        clh_node * pred = m_lock_pred($).load(std::mo_relaxed);
        
        me->succ_must_wait($).store( 0, std::mo_release );

        free_slot( pred );
    }

};

and rather than pass in the nodes I just bit the bullet and allocated them. But now the obvious thing to do is make alloc_slot and free_slot just take & return nodes from an array. But then "me" is just stepping a pointer through an array. So our "linked list" should just be a sequence of adjacent elements in an array :

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    #define NUM_WAYS    16
    // should be cache line sized objects :
    std::atomic<int> succ_must_wait[NUM_WAYS];
    std::atomic<int> m_lock;
    VAR_T(int) m_lock_pred;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store(0);
        succ_must_wait[0]($).store(0);
        for(int i=1;i<NUM_WAYS;i++)
        {
            succ_must_wait[i]($).store(1);
        }
        m_lock_pred($) = 0;
    }
    ~clh_mutex()
    {
    }

    void lock()
    {   
        int pred = m_lock($).fetch_add(1, std::mo_acq_rel);
        pred &= (NUM_WAYS-1);
        
        rl::linear_backoff bo;
        while ( succ_must_wait[pred]($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        // m_lock_pred just remembers my index until unlock
        //  could be a local
        m_lock_pred($) = pred;
    }
    
    void unlock()
    {
        int pred = m_lock_pred($);
        int me = (pred+1)&(NUM_WAYS-1);
        
        // recycle this slot :
        succ_must_wait[pred]($).store(1, std::mo_relaxed);
        
        // free my lock :
        succ_must_wait[me]($).store( 0, std::mo_release );
    }

};

(as usual, m_lock_pred doesn't really belong as a member variable in the lock).

But this is exactly "Anderson's array-based queue lock" that we mentioned at the end of the ticket-lock post, and it's also just a CLH lock with the nodes stuck in an array. This suffers from the big problem that you must have enough array entries for the threads that will touch the lock or it doesn't work (what happens is multiple threads can get into the mutex at the same time, eg. it doesn't actually provide mutual exclusion).

I don't think this is actually useful for anything, but there you go.

07-17-11 - Atman's Multi-way Ticket Lock

But first - Atman sent me a note about the ticket lock which is related and worth sharing.

Imagine we're in a "high performance computing" type environment, we have a mess of threads running, locked on the processor, which is the appropriate time to use something like a ticket lock. Now, they all try to get the lock. What actually happens?

N threads run into the mutex and 1 gets the lock. So the remaining (N-1) go into their spin loops :


while ( m_serving($).load(std::mo_acquire) != my_ticket ) ;

but what is this actually doing? When you try to load that shared variable, what it does is something like :
is this cache line current?
okay read the variable from my copy of the cache line
now the first guy holding the mutex unlocks it. This marks the cache line dirty and that is propagated to all the other cores. Now the remaining (N-1) spinners try to read m_serving again. But this time it does :
is this cache line current?
no it's not, get the new copy of the cache line
okay read the variable from my copy of the cache line
the cache line had to be copied around (N-1) times. You can see the pattern, and to do N unlocks with N waiters you wind up doing N^2 cache line copies. Obviously this is not okay for N large.

(note that this is why putting a backoff pause in your spin loop can actually be a performance advantage even on non-hyperthreaded cores - it reduces cache line traffic ; also in the special case of the ticket lock, the waiters actually know how far they are from the front of the list, so they can do "proportional backoff" by subracting "my_ticket" from "now serving" and pausing for that amount of time).

Okay, but we can do better. Leave m_ticket as a single gate variable, but split m_serving into many "ways" (ways in the cache sense). So depending on your ticket # you look at a different serving number. This is just like a very large bakery - rather than have a single "now serving" sign, we have one for odd tickets and one for even tickets ; you stand on the side of the room with the appropriate sign for you and just read that sign.

The code is :


struct ticket_mutex_ways
{
    enum { NUM_WAYS = 16 };
    std::atomic<unsigned int> m_ticket;
    // note : m_serving objects should be cache-line-size padded :
    std::atomic<unsigned int> m_serving[NUM_WAYS];
    VAR_T(int)  m_lock_holder;
    
    ticket_mutex_ways()
    {
        m_ticket($).store( 0 );
        for(int i=0;i<NUM_WAYS;i++)
            m_serving[i]($).store( 0 );
    }
    ~ticket_mutex_ways()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel); // mo_acquire ; *1
    
        int way = me % NUM_WAYS;
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving[way]($).load(std::mo_acquire);
            if ( me == cur )
                break;
        
            bo.yield($);
        }
        
        m_lock_holder($) = me;
    }

    void unlock()
    {
        int next = m_lock_holder($) + 1;
        int way = next % NUM_WAYS;
        
        m_serving[way]($).store(next,std::mo_release);
    }
};

the key thing is that in the spin loop you are only touching the serving variable in your way, and there is no cache contention with up to NUM_WAYS lockers. (as noted - you need cache line size padding between the variables)

(*1 standard caveat here - this only needs to be acquire but they you need other mechanisms to protect your mutex, so beware)

Note that "m_lock_holder" really doesn't belong in the mutex structure; it's an annoyance that I have to put it there; it should just be held on the stack until the unlock. If you use some kind of "guard" class to wrap your mutex lock/unlock it would be more appropriate to store this in the guard. (in fact it is probably good class design to make lock & unlock take the "guard" as a parameter, because that allows more flexibility).

This is pretty cool, I think it's about as fast as you can get if you don't care about your mutex being rather large. One nice thing about it is that you don't need to know your number of CPUs. There are a lot of similar algorithms that break unless NUM_WAYS is >= number of threads. (for example, you can do basically the same thing but just use a bool in each way to indicate locked or not, and that works fine as long as num threads is < num ways; that would be "Anderson's array-based queue lock" BTW). With Atman's algorithm, you can even choose to make WAYS less, and you will be fast as long as the number of contenders is less than the # of ways.

7/16/2011

07-16-11 - Ticket FIFO Mutex

The Linux kernel internally uses a FIFO spinlock that they call "ticket lock". A ticket or "bakery" algorithm is quite a common pattern so we'll have a glance.

The analogy is the easiest way to understand it. There's an atomic ticket machine, when you walk into the shop you grab a ticket (and the machine increments itself). On the wall is a "now serving" sign that counts up as people turn in their tickets.

This can be implemented most obviously using two ints :


struct ticket_mutex2
{
    // (*0)
    std::atomic<unsigned int> m_ticket;
    std::atomic<unsigned int> m_serving;

    ticket_mutex2()
    {
        m_ticket($).store( 0 );
        m_serving($).store( 0 );
    }
    ~ticket_mutex2()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel);
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving($).load(std::mo_acquire);
            if ( me == cur )
                return;
        
            bo.yield($);
            
            // (*1)
        }
    }

    void unlock()
    {
        // (*2)
        // my ticket must match m_serving
        // (*3)
        //m_serving($).fetch_add(1,std::mo_release);
        unsigned int cur = m_serving($).load(std::mo_relaxed);
        m_serving($).store(cur+1,std::mo_release);
    }
};

*0 : obviously you could put the two counters into words and mush them in one int (Linux on x86 used to put them into bytes and mush them into one word), but it's actually a better demonstration of the algorithm to have them separate, because it's a weaker constraint. Lockfree algorithms always continue to work if you mush together variables into larger atomic pieces, but rarely continue to work if you separate them into smaller independent atomic pieces. So when you're trying to show the fundamental requirements of an algorithm you should use the minimum mushing-together required.

(BTW I don't remotely claim that any of the things I've posted have the minimum synchronization constraints required by the algorithm, but that is always the goal).

*1 : you might be tempted to put a Wait here using eventcount or something, but you can't. The problem is if multiple threads go to sleep there, only the one thread that has the next ticket will be able to take the lock. So if you use a generic waitset, you might wake the wrong thread, it won't be able to get in, and you will deadlock. More on this in a moment.

*2 : m_serving is actually protected by the mutex, it is only ever modified by the mutex holder. m_ticket is actually the barrier variable for acquiring the lock. When you get the lock you could store your ticket id as a member in the lock struct and at unlock it will be equal to m_serving.

*3 : you can of course use an atomic increment on serving but because of *2 it's not necessary, and a simple load & inc is cheaper on some architectures (and as per *1, it's a weaker constraint so we prefer to demonstrate its correctness here).

Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock.

So if we want to turn this into a FIFO lock that works in user space, we have to have a sleep/wakeup mechanism.

I don't think this is actually an awesome way to write your own FIFO lock, but it's a nice demonstration of the usefulness of NT's Keyed Events, so I'm going to do that.

You need to get the secret functions :


template <typename ret,typename T1,typename T2,typename T3,typename T4>
ret CallNtInternal(const char * funcName,T1 arg1,T2 arg2,T3 arg3,T4 arg4)
{
    typedef ret NTAPI t_func(T1,T2,T3,T4);

    t_func * pFunc = (t_func*) GetProcAddress( LoadLibrary( TEXT("ntdll.dll") ), funcName );
    ASSERT_RELEASE( pFunc != NULL );

    return (*pFunc) (arg1,arg2,arg3,arg4);
}

#define MAKE_NTCALL_4(ret,func,type1,type2,type3,type4) ret func(type1 arg1,type2 arg2,type3 arg3,type4 arg4) { return CallNtInternal<ret>(#func,arg1,arg2,arg3,arg4); }

MAKE_NTCALL_4( LONG,NtCreateKeyedEvent,OUT PHANDLE, IN ACCESS_MASK, IN PVOID, IN ULONG );
MAKE_NTCALL_4( LONG,NtReleaseKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER ); 
MAKE_NTCALL_4( LONG,NtWaitForKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER );

and then you can make the lock :


struct ticket_mutex2_keyed
{
    std::atomic<unsigned int> m_state;
    // ticket is bottom word
    // now serving is top word

    HANDLE  m_keyedEvent;

    // keyed event must have bottom bit off :
    enum { WAITKEY_SHIFT = 1 };

    ticket_mutex2_keyed()
    {
        m_state($).store( 0 );
        NtCreateKeyedEvent(&m_keyedEvent,EVENT_ALL_ACCESS,NULL,0);
    }
    ~ticket_mutex2_keyed()
    {
        CloseHandle(m_keyedEvent);
    }

    void lock()
    {
        // grab a ticket and inc :
        unsigned int prev = fetch_add_low_word(m_state($),1);
    
        // if ticket matches now serving I have the lock :
        if ( top_word_matches_bottom(prev) )
            return;
    
        // wait on my ticket :
        unsigned int ticket = prev&0xFFFF;
        intptr_t waitKey = (ticket<<WAITKEY_SHIFT);
        NtWaitForKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
    }

    void unlock()
    {
        // inc now serving :
        unsigned int prev = m_state($).fetch_add((1<<16),std::mo_release);

        // get a local copy of the "now serving" that I published :
        prev += (1<<16);

        // if lock was not made open to new entries :       
        if ( ! top_word_matches_bottom(prev) )
        {
            // wake up the one after me in the sequence :
            unsigned int next = (prev>>16);
            intptr_t waitKey = (next<<WAITKEY_SHIFT);
            NtReleaseKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
        }
    }
};

Note that we have had to push together our two state variables now, because previous unlock only touched the "now serving" counter, but now it has to also check against the ticket counter to see if there are any people waiting.

Also note that we are taking advantage of the fact that ReleaseKeyedEvent is blocking. If the Release happens before the Wait, the signal is not lost - the unlocking thread blocks until the Wait is entered.

Exercise for the reader : make it possible for lock to spin a while before going into the wait.

I made use of these self-explanatory helpers :


bool top_word_matches_bottom( unsigned int x )
{
    unsigned int t = _lrotl(x,16);
    return t == x;
}

unsigned int fetch_add_low_word(std::atomic<unsigned int> & state,int inc)
{
    unsigned int old = state($).load(std::mo_relaxed);
    while ( ! state($).compare_exchange_weak(old,((old+inc)&0xFFFF) | (old & 0xFFFF0000),std::mo_acq_rel) ) { }
    return old;
}

which do what they do.

Obviously on Linux you could use futex, but there are too many platforms that have neither KeyedEvent nor futex, which make using them not very attractive.

Some links :

Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
spinlocks XXXKSE What to do
Linux x86 ticket spinlock
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
futex(2) - Linux manual page

7/15/2011

07-15-11 - Review of many Mutex implementations

This is gonna be a long one. The point of this is not that you should go off and implement your own mutex (don't). The point is that it's educational to understand this simple case, because the issues will be the same in other domain-specific threading problems. A lot of the times people think they are being "safe" by using the OS mutex, but then they still do some atomic CAS on a bool and think "it's no big deal, it's just a CAS and loop" and basically are creating all the same issues of races and livelocks and thrashing without being careful about it.

So, I'm going to present a (hopefully) working implementation of a mutex/lock and then talk about the issues with that type of implementation.

classic single-variable CAS spinlock :


class spinmutex
{
public:

    spinmutex() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        unsigned int prev = 0;
        // (*1)
        while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acquire) )
        //while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acq_rel) )
        {
            b.yield($);
            prev = 0;
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

*1 : I believe the CAS only needs to be acquire, but then you do need some other mechanism to keep stores from moving out the top of the mutex (such as a #loadstore which C++0x doesn't provide), and some way to prevent mutexes from moving to overlap each other (which could lead to deadlock). So it's probably easiest to just make the CAS be acq_rel even though it doesn't need to be. (see previous post on the barriers that we think a mutex needs to provide). Most of the mutexes here have this issue and we won't mention it again.

For some reason people love to implement the basic spinlock with CAS, but in fact you can do it just with exchange :


class spinmutex2
{
public:

    spinmutex2() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        while ( m_lock($).exchange(1,std::mo_acquire) )
        {
            b.yield($);
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

which is cheaper on some platforms.

So, there are a few problems with spinmutex. The most obvious is that you have to just spin, the threads which can't get in don't go to sleep. The other problem is that it doesn't respect OS scheduling directives (thread priorities) and it's quite un-"fair", in that it doesn't order access at all, and in fact greatly favors the last thread in, since it's most likely to be getting CPU time.

So we want to make it sleep. The pattern for making lock-free primitive sleep is to change :


while ( ! trylock() ) { spin }

to :

if ( ! trylock() )
{
  register desire to wait

  if ( trylock() ) return (cancel wait)

  wait;
}

that is, a double-checked wait. (and then perhaps loop, depending on whether waking from wait implies the condition). The reason you need to do this is that before the "register waiter" has finished putting you in the wait list, the lock may become open. If you didn't try to acquire the lock again, you would miss the wake signal.

So the easiest way to transform our spin mutex into one that sleeps is with eventcount :

eventcount sleeping exchange lock :


class ecmutex1
{
public:

    ecmutex1() : m_lock(0)
    {
    }
    
    void lock()
    {
        while ( m_lock.exchange(1,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            if ( m_lock.exchange(1,std::memory_order_acquire) == 0 )
                return; // got the lock
            
            // wait for signal :
            m_ec.wait(ec_key);
            // now retry
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock.load(std::memory_order_relaxed) == 1 );
        m_lock.store(0,std::memory_order_release);
        m_ec.signal();
    }

private:
    //std::atomic<unsigned int> m_lock;
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
};

now, this is okay, but there are a few problems.

One is that the signal for the eventcount is just a "hey wake up and see if you can get the lock" , it's not a "hey wake up you have the lock". That means it can suffer from what I call "thrashing" or spurious wakeup (this is not technically "spurious wakeup" , a true "spurious wakeup" would be a wakeup that didn't come from calling signal()). You might wake a thread, it fails to get the lock, and goes right back to sleep. So that sort of sucks.

This issue is closely related to a fairness problem; we might be able to ensure some level of fairness through eventcount, but that is ruined by the fact that spinning threads can jump in and grab the lock before the one we signalled.

Another issue is that we are calling "signal" every time even when there is no waiter. This is a minor issue because your eventcount probably checks for waiters and does nothing (if it's a good implementation - a bad implementation might implement signal by immediately taking a mutex, in which case you really want to avoid calling it if you have no waiters).

Anyway, Thomasson showed how to improve this last little bit of inefficiency. You use one bit to flag locking and one bit to flag waiting, and you only need to signal the eventcount if the waiting bit is on :


class ecmutex2
{
public:

    enum { UNLOCKED = 0, LOCKED = 1, LOCKED_WAITING = (LOCKED|2) };

    ecmutex2() : m_lock(UNLOCKED)
    {
    }
    
    void lock()
    {
        unsigned int prev = 0;
        // this CAS could be a bit-test-and-set :
        while ( ! m_lock.compare_exchange_strong(prev,LOCKED,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            // change LOCKED->LOCKED_WAITING (and then we will wait)
            // or change UNLOCKED->LOCKED_WAITING (and we take the lock)
            prev = m_lock.exchange(LOCKED_WAITING,std::memory_order_acquire);
            if ( prev == UNLOCKED )
                return;
                
            m_ec.wait(ec_key);
            
            // now retry
            prev = 0;
        }
    }
    
    void unlock()
    {
        unsigned int local = m_lock.load(std::memory_order_relaxed);
        RL_ASSERT( local & LOCKED );
        m_lock.store(UNLOCKED,std::memory_order_release);
        // could always signal :
        //m_ec.signal();
        // faster because it avoids an atomic :
        unsigned int check = m_lock.load(std::memory_order_relaxed);
        if ( (local|check) & LOCKED_WAITING )
        {
            m_ec.signal();
        }
    }

private:
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
    
};

you have to use a CAS (not an exchange) to take the lock initially, because you can't turn off the WAITING bit. The entire advantage of this method is the fact that in the uncontended case (no waiters), unlock only does a load_relaxed instead of the atomic op needed in eventcount to check if signal is necessary.

Note : in some cases it may be an improvement to spin a bit before going to sleep in the lock() side. It also can be an optimization to spin a bit before signalling in the unlock (to see if the WAITING flag turns off) - however, both of these hurt fairness, they make the mutex more LIFO than FIFO, which can indeed be an optimization in many cases, but is also dangerous (more notes on this issue elsewhere). If a thread was already asleep on the mutex, it will tend to stay asleep forever if there are other awake threads that keep trading the mutex around.

Anyhoo, you can implement the exact same thing using windows Event instead of eventcount :

Three-state mutex using Event :


// Thomasson's simple mutex based on windows event :
struct win_event_mutex
{
    std::atomic<int> m_state; // = 0
    HANDLE m_waitset; // auto reset event; set to false

    win_event_mutex()
    {
        m_state($) = 0;
        m_waitset = CreateEvent(NULL,0,0,NULL);
    }
    ~win_event_mutex()
    {
        CloseHandle(m_waitset);
    }

    void lock()
    {
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            while ( m_state($).exchange(2,rl::mo_acquire) )
            {
                WaitForSingleObject(m_waitset, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,rl::mo_release) == 2 )
        {
            SetEvent(m_waitset);
        }
    }
};

the three states are again "0 = unlocked", "1 = locked (exclusive)" , "2 = contended (locked with waiter)".

(I got this from Thomasson but I believe it's actually an old algorithm; I've seen it discussed in many blogs. there is a slightly subtle state transition where m_state can be 2 (contended) and then someone comes in to lock() and exchanges it to 1 (locked, uncontended); that seems to be bad, because there is a Waiter which might now miss a signal (because we turned off the contended flag), but in fact it's okay because if that happens we will then step in and take the lock in the conteded state (by exchanging in 2) and when we unlock we will signal the waiter. So this is another way of doing "unfair" acquisition (the later-entering thread gets the lock even though there was already a waiter) but it is not a lost wakeup).

The unlock is slightly more expensive because it's an exchange instead of just a store. This mutex is "fair" (as fair as Win32 native primitives ever are) for waiters that actually get into the waitset, because all the threads wait on the same Event and thus will get the OS priorities and boosts and so on. But it still doesn't hand off the lock in the wakeup and so on. (I guess I'll call respecting the OS scheduler "pseudo-fair" ; if the mutex implementation is at least as fair as the OS mutex)

BTW this mutex is very similar to the futex-based mutex that Bartosz described here ; in general anywhere you see Win32 event you could use futex on Linux, since futex is a superset of event.

We're going to take a diversion now and look at some other topics in mutex design - in particular, avoiding allocation of OS events unless/until they're actually needed (Win32 CRITICAL_SECTION does this, for example).

Event mutex that makes event on demand :


// Thomasson's simple mutex based on windows event :
// version that does event creation on demand
struct win_event_mutex_ondemand
{
    std::atomic<int> m_state; // = 0
    std::atomic<HANDLE> m_waitset; // auto reset event; set to false

    win_event_mutex_ondemand()
    {
        m_state($) = 0;
        m_waitset($) = 0;
    }
    ~win_event_mutex_ondemand()
    {
        if ( m_waitset($) != 0 )
            CloseHandle(m_waitset($));
    }

    void lock()
    {
        if ( m_state($).exchange(1,std::mo_acq_rel) )
        {
            HANDLE h = m_waitset($).load(std::mo_acquire);
            if ( h == 0 )
            {
                HANDLE newH = CreateEvent(NULL,0,0,NULL);
                if ( m_waitset($).compare_exchange_strong(h,newH,std::mo_acq_rel) )
                {
                     h = newH;
                }
                else
                {
                    // loaded h
                    RL_ASSERT( h != 0 );
                    CloseHandle(newH);
                }
            }
            RL_ASSERT( h != 0 );
            while ( m_state($).exchange(2,std::mo_acq_rel) )
            {
                WaitForSingleObject(h, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,std::mo_acq_rel) == 2 )
        {
            HANDLE h = m_waitset($).load(std::mo_relaxed);
            RL_ASSERT(h != 0 );
            SetEvent(h);
        }
    }
};

This is just the same as the previous win_event_mutex , except that it makes the event on first use, using the modern speculative-creation singleton method.

This works fine, unfortunately it is difficult to ever free the event, so once we make it we have it forever. If your goal is to do something like have 4k mutexes that only use 32 OS events, you can't do it this way. (in general you only need as many waitable handles as you have threads, and you might want to have many more lockable objects than that).

I implemented one way of making a mutex that releases its event when not needed, but it's a bit ugly :

Event mutex that only holds event during contention :


struct win_event_mutex2
{
    struct state // 64 bit double-word 
    {
        // two 32 bit words :
        int lock; HANDLE event; 

        state(int l,HANDLE h) : lock(l), event(h) { } 
        state() : lock(0), event(0) { } 
        bool operator == (const state & rhs) const { return lock == rhs.lock && event == rhs.event; }
    };
    
    std::atomic<state> m_state; // = 0

    win_event_mutex2()
    {
    }
    ~win_event_mutex2()
    {
        state local = m_state($);
        if ( local.event != 0 )
            CloseHandle(local.event);
    }

    void lock()
    {
        HANDLE newH = 0;
            
        state oldState = m_state($).load(std::mo_acquire);
        state newState;
        
        for(;;)
        {
            // increment the lock count :
            newState = oldState;
            newState.lock = oldState.lock+1;
            
            // if there is contention, make sure there is an event to wait on :
            if ( newState.lock > 1 && newState.event == 0 )
            {
                if ( newH == 0 )
                {
                    newH = CreateEvent(NULL,0,0,NULL);
                }
                
                newState.event = newH;              
            }
            else if ( newState.lock == 1 && newState.event == newH )
            {
                newState.event = 0;
            }

            // try to swap in the lock count and event handle at the same time :
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;

        }
        
        if ( newH && newH != newState.event )
        {
            // I made an event but didn't use it
            CloseHandle(newH);
        }
                
        if ( oldState.lock == 0 )
        {
            // I got the lock
            RL_ASSERT( newState.lock == 1 );
            return;
        }
        
        // lock is contended :
        RL_ASSERT( newState.lock > 1 );
        RL_ASSERT( newState.event != 0 );

        WaitForSingleObject(newState.event, INFINITE);
        
        // I own the mutex now!
    }

    void unlock()
    {
        state oldState = m_state($).load(std::mo_acquire);
        RL_ASSERT( oldState.lock >= 1 );
        state newState(0,0);
        
        // at this moment I own the mutex
        
        for(;;)
        {
            // release the lock, and if we're no longer contended remove the event
            RL_ASSERT( oldState.lock >= 1 );
            newState = oldState;
            newState.lock--;
        
            if ( newState.lock == 0 && newState.event != 0 )
            {
                newState.event = 0;
            }
        
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;
        }

        if ( oldState.event )
        {
            if ( newState.event == 0 )
            {
                RL_ASSERT( newState.lock == 0 );

                CloseHandle(oldState.event);
            }
            else
            {
                RL_ASSERT( newState.lock > 0 );
                SetEvent(oldState.event);
            }
        }
    }
};

This is always the cop-out method for implementing lock free algorithms - take the two variables that you need to stay in sync (in this case the lock count and the presence of a waitable event) - and just mush them together into a bigger word and CAS it atomically. That way you don't have to think carefully about all the funny state transition possibilities. I'm sure someone could do a better version of this that's not so expensive in atomic ops on the fast path (no contention).

(btw it's safer / more portable to make "state" just be a uint64 and do the packing manually with shifts and ors, don't use a struct inside std::atomic, it causes too many headaches and is too risky)

(to be clear, the problem with this algorithm is the no-contention fast path is way more expensive than any of our previous mutexes).

Also note : you shouldn't actually use CreateEvent/CloseHandle with something like this, you should have a recycling pool of OS Events that you alloc/free from an event cache of some kind. As I said before, you only need one per thread. If you do use a pool like this, you have to be a bit careful about whether they can be recycled in signalled state, and whether you want to try to Reset them at some point (beware someone could still be in the process of waking up from it), or just make your algorithm work okay with a pre-signalled event, or something.

There is one way this mutex is better than any previous one - when a thread receives a wakeup, it is never bogus; it automatically has the lock when it is woken up. Also under contention there can be no "stolen locks" from threads that jump the queue and quickly grab the atomic - one of the sleeping threads will always be woken.

(btw "pseudo-fairness" like this is not always better; in the case that all your threads are equivalent workers, you actually want a LIFO mutex, because LIFO keeps caches hotter and makes thread switches less likely. However, some code can be stalled indefinately by LIFO mutexes, so they are very dangerous in the general case.)

Okay, that's enough mutexes for one post, we'll do some followups in a later one.

7/14/2011

07-14-11 - Some obscure threading APIs

Futex :

Futex from a Win32 programmer's perspective is an enhanced version of the windows Event. It has absolutely nothing to do with a "fast user space mutex". It's a much lower level primitive - it's a "waitset" if you like (more about "waitset" in a future post). Basically it lets you put a thread to sleep with a Wait(), or wake up one or more threads. It has several advantages over Win32 Event which make it very nice.

Futex is associated with the address of an int. This means you don't have to actually create a Futex object, any int in your system can be used as the handle for a waitable event. This is nice.

Futex Wait atomically checks the int vs. a certain value. The basic futex Wait op is :

atomically {
  if ( *address == check_value ) Wait();
  else return;
}
this atomic check before waiting is exactly what you want for implementing lots of threading primitives (mutex, conditionvar, etc.) so that's very handy. The normal thing you would write is something like :

thread 1:

if ( queue empty / mutex locked / whatever )
{
    Wait();
}

thread 2:

push queue / unlock mutex / whatever
Signal();

but that contains a race. You can fix it very easily with futex by passing in the condition to check to the Wait, like :

thread 1 :
if ( num_items == 0 )
    futex_wait( &num_items, 0 );

thread 2 :
num_items++;
Signal();

(note that the alternative to this is a prepare_wait/wait pair with a double-check of the condition after the prepare_wait, and signal applies to anyone who has prepared, not anyone who has waited)

Futex Wake can wake up N threads (sadly Win32 event only provides "wake 1" in the robust auto-reset mode). That's nice.

Some reference :
Futexes are tricky (PDF) (no they're not, BTW)
Thin Lock vs. Futex � ��Bartosz Milewski's Programming Cafe
Mutexes and Condition Variables using Futexes
futex(2) - Linux manual page

Now some Win32 :

SignalObjectAndWait :

SignalObjectAndWait (Win2k+).

This seems pretty hot, because you would think it was atomic (signal and wait in one kernel op), but it is *NOT* :

Note that the "signal" and "wait" are not guaranteed to be performed as an atomic operation. Threads executing on other processors can observe the signaled state of the first object before the thread calling SignalObjectAndWait begins its wait on the second object.

which means it's useless. It's just the same as calling Signal() then Wait(). Maybe it's less likely that you get swapped out between the two calls, which would reduce thread thrashing and also reduce the number of system calls, but it does not help you with correctness issues (an actually atomic SignalAndWait would let you implement things differently and solve some lost-wakeup problems).


thread1 :
  Signal(event1);
  /* (!1) */
  Wait(event2);

thread 2:
  Wait(event1);
  Signal(event2); // (!2)
  Wait(event2); // ! stolen wakeup

So first of all, the separate Signal & Wait is a performance bug because if thread1 loses the CPU at (!1) (or immediately upon signalling), then in thread 2 before (!2) you transfer execution back to thread1, it immediately goes to sleep. That's lame, and it's why you prefer to do these things atomically. But, it's even worse in this case, because at (!2) we intended that Signal to go to thread1 and wake him up, but he isn't in his sleep yet. Because SignalAndWait is not atomic we have a race. If it was atomic, code like this could actually be correct.

Windows NT Keyed Events (NtWaitForKeyedEvent/NtReleaseKeyedEvent) :

This is an NT internal API (it's in NTDLL since Win XP, so despite being internal-only it's actually more usable than the Win2k+ or Vista+ stuff (revision : duh, brain fart, not true, 2k+ is fine, only Vista+ stuff is problematic)). It's a very nice extension to the Win32 Event. Basically it lets you associate a value with the Wait and Signal.

NtWaitForKeyedEvent : analogous to Wait() on an Event, but also takes a value (the value is PVOID but it is just a value, it doesn't deref). Execution is stopped until a signal is received with that particular value.

NtReleaseKeyedEvent : analogous to SetEvent() (on an auto-reset event) - eg. it wakes exactly 1 one thread - except that it actually blocks if no thread is waiting - that is, this always wakes exactly 1 thread (never zero, which SetEvent can do). Release also takes a value and only wakes threads waiting on that value.

So for example if you want you can use this to make a primitive implementation of futex. You have one global event (futex_event), then FutexWait(address) does WaitForKeyedEvent(futex_event,address) , using address as the value to wait for, and FutexWake(address) does ReleaseKeyedEvent(futex_event,address). (though obviously this is not a proper futex because it can't broadcast and can't check a value and so on).

More usefully, you can create a mutex which applies to an array! Something like :


template<typename T>
lockable_vector
{
  Event  vector_event;
  vector<T>   items;
  vector<int> contention;

  T * lock(int i)
  {
    if ( contention[i] ) NtWaitForKeyedEvent(vector_event, i);
    contention[i]++;
    return &items[i];
  }

  void unlock(int i)
  {
    contention[i]--;
    if ( contention[i] ) NtReleaseKeyedEvent(vector_event, i);
  }
}

(obviously this just is a rough sketch; you have to update "contention" properly atomically so that you only do the Wait and Release in the right cases) (I'll post a working version of this sometime soon).

(small issue with this : the lowest bit of the key must be unset ; apparently they use it as a flag bit, so you need a 2* or something to use it this way, or just give it aligned pointer addresses as the key)

The point is - you only have one kernel event, and you can mutex on any item in your vector; you can resize the vector and don't have to create/destroy mutexes. Cool! In the typical scenario that you have maybe 2-4 threads accessing an array of 4k items, this is a big win. (in fact this is something that's important in Oodle so I have a few implementations of how to do this without WaitForKeyedEvent).

KeyedEvent is implemented in the obvious way - when a thread in Win32 waits on a handle it gets stuck in that handle's data structure on a linked list. They just added a "wait_value" field to that linked list. So now when you do ReleaseKeyedEvent instead of just popping off the first thread in the list, it walks through the list in the kernel and tries to find a thread whose "wait_value" matches your signal value.

Some reference :
Slim Reader Writer Locks - Under The Hood - Matt Pietrek - Site Home - MSDN Blogs
NtCreateKeyedEvent
Keyed Events (lockless inc)
Concurrent programming on Windows - Google Books

07-14-11 - compare_exchange_strong vs compare_exchange_weak

The C++0x standard was revised a while ago to split compare_exchange (aka CAS) into two ops. A quick note on the difference :

bool compare_exchange_weak( T * ptr, T * old, T new );

bool compare_exchange_strong( T * ptr, T * old, T new );

(BTW in the standard "old" is actually a reference, which is a damn shitty thing to do because it makes it very non-transparent that "old" gets mutated by these functions, so I am showing it as a pointer).

both try to do :


atomically {
  if ( *ptr == *old ) { *ptr = new; return true; }
  else { *old = *ptr; return false; }
}

the difference is that compare_exchange_weak can also return false for spurious failure. (the original C++0x definition of CAS always allowed spurious failure; the new thing is the _strong version which doesn't).

If it returns due to spurious failure, then *old might be left untouched (and in fact, *ptr might be equal to *old but we failed anyway).

If spurious failure can only occur due to contention, then you can still gaurantee progress. In fact in the real world, I believe that LL-SC architectures cannot gaurantee progress, because you can get spurious failure if there is contention anywhere on the cache line, and you need that contention to be specifically on your atomic variable to gaurantee progress. (I guess if you are really worried about this, then you should ensure that atomic variables are padded so they get their own cache line, which is generally good practice for performance reasons anyway).

On "cache line lock" type architectures like x86, there is no such thing as spurious failure. compare_exchange just maps to "cmpxchg" instruction and you always get the swap that you want. (it can still fail of course, if the value was not equal to the old value, but it will reload old). (BTW it's likely that x86 will move away from this in the future, because it's very expensive for very high core counts)

compare_exchange_weak exists for LL-SC (load linked/store conditional) type architectures (Power, ARM, basically everything except x86), because on them compare_exchange_strong must be implemented as a loop, while compare_exchange_weak can be non-looping. For example :

On ARM, compare_exchange_weak is something like :

compare_exchange_weak:

  ldrex     // load with reservation
  teq       // test equality
  strexeq   // store if equal
and strexeq can fail for two reasons - either because they weren't equal, or because the reservation was lost (because someone else touched our cache line).

To implement compare_exchange_strong you need a loop :

compare_exchange_strong:

  while ( ! compare_exchange_weak(ptr,old,new) ) { }

(note that you might be tempted to put a (*old = *ptr) inside the loop, but that's probably not a good idea, and not necessary, because compare_exchange_weak will eventually load *ptr into *old itself when it doesn't fail spuriously).

The funny bit is that when you use compare_exchange you often loop anyway. For example say I want to use compare_exchange_strong to increment a value, I have to do :


cur = *ptr;
while( ! compare_exchange_strong(ptr,&cur,cur+1) ) { }

(note it's a little subtle that this works - when compare_exchange_strong fails, it's because somebody else touched *ptr, so we then reload cur (this is why "old" is passed by address), so you then recompute cur+1 from the new value; so with the compare_exchange_strong, cur has a different value each time around this loop.)

But on an LL-SC architecture like ARM this becomes a loop on a loop, which is dumb when you could get the same result with a single loop :


cur = *ptr;
while( ! compare_exchange_weak(ptr,&cur,cur+1) ) { }

Note that with this loop now cur does *not* always take a new value each time around the loop (it does when it fails due to contention, but not when it fails just due to reservation-lost), but the end result is the same.

So that's why compare_exchange_weak exists, but you might ask why compare_exchange_strong exists. If we always use loops like this, then there's no need for it. But we don't always use loops like this, or we might want to loop at the much higher level. For example you might have something like :

bool SpinLock_TryLock(int * lock)
{
  int zero = 0;
  return compare_exchange_strong(lock,&zero,1);
}
which returns false if it couldn't get the lock (and then might do an OS wait) - you don't want to return false just because of a spurious failure. (that's not a great example, maybe I'll think of a better one later).

(BTW I think the C++0x stuff is a little bit broken, like most of C standardization, because they are trying to straddle this middle ground of exposing the efficient hardware-specific ways of doing things, but they don't actually expose enough to map directly to the hardware, and they also aren't high level enough to separate you from knowing about the hardware. For example none of their memory model actually maps directly to what x86 provides, therefore there are some very efficient x86-specific ways to do threading ops that cannot be expressed portable in C++0x. Similarly on LL-SC architectures, it would be preferrable to just have access to LL-SC directly.

I'd rather see things in the standard like "if LL-SC exist on this architecture, then they can be invoked via __ll() and __sc()" ; more generally I wish C had more conditionals built into the language, that would be so much better for real portability, as opposed to the current mess where they pretend that the language is portable but it actually isn't so you have to create your own mess of meta-language through #defines).

07-14-11 - ARM Atomics

Some quick notes for reference in the future.

In general when I'm porting atomics to a new platform, what I would really like is a document from the hardware maker that describes their memory semantics and cache model in detail. Unfortunately that can be very hard to find. You also need to know how the compiler interacts with that memory model (eg. what does volatile mean, how do I generate compiler reorder barriers, etc). Again that can be very hard to find, particularly because so many compilers now are based on GCC and the GCC guys are generally stubborn punk-asses about clearly defining how they behave in the "undefined" parts of C which are so crucial to real code.

So assuming you can't find any decent documentation, the next place to look is some of the large cross-platform codebases. Probably the best one of all is Linux, because it's decently well tested. Unfortunately you can't just copy code from these since most of them are GPL, but you can use them as educational material to figure out the memory semantics of a platform. So some things we learn from Linux straight off :

1. Before ARMv6 you're fucked. There are no real atomic ops (SWP is not good enough) so you have to use some locking/critical-section mechanism to do atomics. Linux in the kernel does this by blocking interrupts, doing the atomic, then turning them back on. If you're not in the kernel, on Linux there's a secret syscall function pointer you can use, or non-Linux you have to use SWP to implement a spinlock which you then use to do CAS and such.

2. With ARMv6 you can use ldrex/strex , which seems to be your standard LL-SC kind of thing.

3. If you're SMP you need full memory barriers for memory ordering.

One thing I don't know is whether any of the Apple/Android consumer ARM multi-core chips are actually SMP ; eg. do they have separate caches, or are they shared single cache with multiple execution units?

Some reference I've found :

[pulseaudio-discuss] Atomic operations on ARM
Wandering Coder - ARM
QEmu - Commit - ViewGit
pulseaudio-discuss Atomic operations on ARM 1
Old Nabble - gcc - Dev - atomic accesses
Linux Kernel Locking Techniques
Linux Kernel ARM atopics
Debian -- Details of package libatomic-ops-dev in sid
Data alignment Straighten up and fly right
Broken ARM atomic ops wrt memory barriers (was [PATCH] Add cmpxchg support for ARMv6+ systems) - Patchwork
Au�ergew�hnlicher Migrationsdruck
Atomic - GCC Wiki
ARM Technical Resources
ARM Information Center

ARM RealView compiler has some interesting intrinsics that are not documented very well :


ARM RealView has :
    __force_stores
    __memory_changed
    __schedule_barrier
    __yield
    __strexeq

one trick in this kind of work is to find a compiler that has intrinsics you want and then just look at what assembly is generated so that you can see how to generate the op you want on that platform.

(but beware, because the intrinsics are not always correct; in particular the GCC __sync ops are not all right, sometimes have bugs, and sometimes their behavior is "correct" but doesn't match the documentation; it's very hard to find correct documentation on what memory semantics the GCC __sync ops actually gaurantee).

Anyway, maybe I'll update this when I get some more information / do some more research.

7/13/2011

07-13-11 - Good threading design for games

I want to back up and talk a bit about how threading should be used in modern games and what makes good or bad threading design in general.

Games are sort of a weird intermediate zone. We aren't quite what would be considered "real-time" apps (a true "real-time" app sets itself to the maximum possible priority and basically never allows any thread switches), nor are we quite "high performance computing" (which emphasizes throughput over latency, and also takes over the whole machine), but we also aren't just normal GUI apps (that can tolerate long losses of cpu time). We sort of have to compromise.

Let me lay out a few principals of what I believe is good threading for games, and why :

1. Make threads go to sleep when they have nothing to do (eg. don't spin). A true real-time or high-performance app might have worker threads that just spin on popping work items off a queue. This allows them to have few-hundred-clock latencies gauranteed. This is not appropriate for games. You are pretty much always running on a system that you need to share. Obviously on the PC they may have other apps running, but even on consoles there are system threads that you need to share with and don't have full control over. The best time to do this is when your threads are idle, so make your threads really go to sleep when they are idle.

2. Never use Sleep(n). Not ever. Not even once. Not even in your spin-backoff loops. In literature you will often see people talk of exponential backoffs and such. Not for games.

The issue is that even a Sleep(1) is 1 milli. 1 milli is *forever* in a game. If the sleep is on the critical path to getting a frame out, you can add 1 milli to your frame time. If you hit another sleep along the critical path you have another millisecond. It's no good.

3. All Waits should be on events/sempahores/what have you (OS waitable handles) so that sleeping threads are woken up when they can proceed, not on a timer.

4. Thread switches should be your #1 concern. (other than deadlocks and races and such things that are just bugs). Taking a mutex is 100 clocks or so. Switching threads is 10,000 clocks or so. It's very very very bad. The next couple of points are related to this :

5. "Spurious" or "polling" wakeups are performance disasters. If a thread has to wake up, check a condition, and then goes back to sleep, you just completely wasted 10,000 clocks. Ensure that threads are waiting on a specific condition, and they are only woken up when that condition is true. This can be a bit tricky when you need to wait on multiple conditions (and don't have Win32's nice WaitForMultiple to help you).

6. You should sacrifice some micro-efficiency to reduce thread thrashing. For example, consider the blocking SPCS queue using semaphore that we talked about recently. If you are using that to push work items that take very little time, you can have the following pattern :

  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
constantly switching. Big disaster. There are a few solutions. The best in this case would be to detect that the work item was very trivial and just do it immediately on the main thread rather than pushing it to a worker. Another would be to batch up a packet of work and send all of them at once instead of posting the semaphore each time. Another is to play with priorities - bump up your priority while you are making work items, and then bump it back down when you want them all to fire.

Basically doing some more work and checks that reduce your max throughput but avoids some bad cases is the way to go.

7. Mutexes can cause bad thread thrashing. The problem with mutexes (critical sections) is not that they are "slow" (oh no, 100-200 clocks big whoop) in the un-contended case, it's what they do under contention. They block the thread and swap it out. That's fine if that's actually what you want to do, but usually it's not.

Consider for example a queue that's mutex protected. There are a bunch of bad articles about the performance of mutex-protected queues vs. lock-free queues. That completely misses the point. The problem with a mutex-protected queue is that when you try to push an item and hit contention, your thread swaps out. Then maybe the popper gets an item and your thread swaps back in. Thread swapping in & out just to push an item is terrible. (obviously mutexes that spin before swapping try to prevent this, but they don't prevent the worst case, which means you can have unpredictable performance, and in games the most important thing is generally the worst case - you want to minimize your maximum frame time)

Basically, mutexing is not a good reason to swap threads. You should swap threads when you are out of work to do, not just to let someone get access to a blocked shared variable, and then you'll run some more when they're done.

8. Over-subscription is bad. Over-subscription (having more threads than cores) inherently creates thread switching. This is okay if the threads are waiting on non-computational signals (eg. IO or network), but otherwise it's unnecessary. Some people think it's elegant to make a thread per operation type, eg. a ray cast thread, a collision thread, a particle thread, etc. but in reality all that does is create thread switch costs that you don't need to have.

In video games, you basically don't want the time-slicing aspect of threads to ever happen (in the absense of other apps). You want a thread to run until it has to wait for something. Over-subscribing will just cause time-slice thread switching that you don't need. You don't want your render thread and your physics thread to time-slice and swap back and forth on one core. If they need to share a core, you want all your physics work to run, then all your render work to run, sequentially.

9. Generally a thread per task is bad. Prefer a thread per core. Make a task system that can run on any thread. You can then manually sequence tasks on the cores by assigning them to threads. So in the example above instead have a physics task and a render task, which might be assigned to different cores, or might be assigned to one thread on one core and run sequentially.

7/10/2011

07-10-11 - Mystery - Do you ever need Total Order (seq_cst) -

The seq_cst memory model ops in C++0x are very mysterious to me. In particular, when exactly do you need them and why? Let's talk about some issues.

First of all, one quirk is that C++0x has no way of expressing a #StoreLoad barrier. Because of that, you often wind up having to use seq_cst mem ops just to get a StoreLoad. (for example in "eventcount" seq_cst is needed, but that's just to get the StoreLoad). But this is rather unfortunate because seq_cst is actually much heavier than StoreLoad. It also enforces "total order".

( there are some more quirks in the details of C++0x ; for example an acq_rel or seq_cst is really only the barrier you think it is if it's an "exchange" (read and write). That is, a store marked acq_rel is not actually acq_rel. Similarly, seq_cst fences don't act like you think they do (see Dmitry's posts on this); in particular you cannot just translate a #StoreLoad from another language into a fence(seq_cst) in C++0x (UPDATE : well, usually you can); also seq_cst in C++0x only schedules against other atomic ops with ordering constraints; they do not schedule against non-atomic memory loads, so in that sense it is not like a memory barrier at all ) (UPDATE : this is basically true, but see clarification in newer posts here : cbloom rants 05-31-12 - On C++ Atomic Fences Part 2 and cbloom rants 05-30-12 - On C++ Atomic Fences ).

Let's compare against a theoretical "all_barriers" op, which has LoadLoad,StoreStore,StoreLoad, and LoadStore barriers on both sides of it. seq_cst is the same thing, but also enforces total order. Why would we ever want that?

What does total order do? Well the classic toy example is something like this :


int a,b,c,d = 0,0,0,0;

thread1 : 
  a = 1;
  b = 1;

thread2 :
  c = 1;
  d = 1;

thread3 :
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

thread4:
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

if all the ops are "all_barriers" then we know that the write to b must be after the write to a, and the write to d must be after the write to c, but the ops to {ab} can schedule against the ops to {cd} in any way - and in particular thread 3 and 4 can see them in different ways. So for example with all_barriers, this is a valid output :

thread3 :
 sees {abcd}
  1,0,0,0

thread4 :
 sees {cdab}
  0,0,1,1

if you use seq_cst the order of the stores on the bus is a single linear order. eg. maybe it's {acbd} or whatever, the point is its the same for all observers. In particular, if we pretend that thread3 and 4 can run instantly and atomically then they would print the same thing.

ADDENDUM : An actual working demonstration goes like this :

shared int a,b = 0,0;
result int c,d = 0,0;

thread1 :
  a = 1;

thread2 : 
  b = 1;

thread3 :
  local int A = a;
  local int B = b;
  if ( A == 1 && B == 0 )
    c = 1;

thread4 :
  local int B = b;
  local int A = a;
  if ( B == 1 && A == 0 )
    d = 1;

// invariant :

if memory order is seq_cst then (c+d) == 2 is impossible
any other memory order and (c+d) == 2 is possible

that is, threads 1 & 2 independently write to a & b. If you use a total order, then thread 3 and 4 must see either {ab} or {ba} - both are valid, but the point is it's the same. If you don't use total order then they could see the order differently.

We then check the order in a race-free way. c=1 can only be set if the order was {ab} , d=1 can only be set if the order was {ba} , therefore with seq_cst it's impossible to see both c=1 and d=1.

(end of addendum).

(note that if you only talk about two threads, then this "total order" issue never comes up, and acq_rel is the same as seq_cst; it only becomes different when three or more threads are watching each other)

But this is quite a weird thing to require; does it ever actually matter?

(just to confuse things, Intel recently changed the spec of the MFENCE instruction to enfore "causal consistency" instead of "sequential consistency" ; presumably this is part of the push to higher core counts and the goal is to get away from the very expensive system-wide sequence point that sequential consistency requires. "causal consistency" provides a total order only for operations which are "causally related" - eg. at the same memory location, or tied together by being used together in an atomic op).

(briefly : in C++0x a seq_cst fence acts to strictly order only other seq_cst ops; it also acts as acq_rel fence; it does not act to order ops that are not seq_cst ; for example it is not a #StoreStore barrier for two relaxes stores, one before and one after the fence (unless they are seq_cst themselves) ; part of the confusion comes from the fact that in practice on most platforms a seq_cst fence is implemented with an instruction like MFENCE which is in fact a stronger barrier - it orders all ops as being strictly before the mfence or strictly after).

(as usual the C++ language lawyer guys are spiraling off into a pit of abstraction and mildly fucking up; I'll do a followup post on how it should have been done).

Dekker's mutex is sometimes cited as an example that needs sequential consistency, because it is doing that sort of weird stuff where it uses multiple variables and needs the writes to them to go in the right order, but it actually doesn't need seq_cst. All it needs are #StoreLoad barriers. ( see here - though I think this code is actually broken)

Unfortunately StoreLoad is very awkward to express in C++0x ; one way to do it is to change the load into an AtomicIncrement by 0, so that it's a read-modify-write (RMW), then you can use a #StoreStore (which a normal "release" constraint). For example Dekker's lock can be expressed like this :


    void lock(int tid)
    {
        int him = tid^1;
        rl::linear_backoff b1;
        
        flag[tid]($).store(1,std::mo_relaxed);

        // #StoreLoad here

        while ( flag[him]($).fetch_add(0,std::mo_acq_rel) )
        {       
            b1.yield($);
            if ( turn($).load(std::mo_relaxed) != tid )
            {
                rl::linear_backoff b2;
                flag[tid]($).store(0,std::mo_relaxed);
                while ( turn($).load(std::mo_relaxed) != tid )
                {
                    b2.yield($);
                }   
                flag[tid]($).store(1,std::mo_relaxed);

                //#StoreLoad here
            }
        }
    }
    
    void unlock(int tid)
    {
        turn($).store( tid^1  , std::mo_release);
        flag[tid]($).store( 0 , std::mo_release);
    }

(backoffs are needed to make Relacy work). Now obviously the RMW is an unfortunate stupid expense, it requires a chunk of CPU time to hold that cache line, but it might be better than a seq_cst op, depending on how that's implemented.

So what I'm struggling with is imagining a situation that actually needs "total order" (and "needs" in a fundamental sense, not just because the coder was a dummy doing silly things). That is, you can easily cook up bad code that needs total order because it is making dumb assumptions about disparate memory ops becoming visible in the same order on different processors, but that's just bad code.

7/09/2011

07-09-11 - TLS for Win32

So as noted in previous discussion of TLS , there's this annoying problem that it's broken for DLL's in Win32 (pre-Vista).

The real annoyance as a library writer is that even if you compile a .lib, somebody might want to use your lib in a DLL, and you can't know that in advance (I guess you could build a separate version of your lib for people who want to put it in a DLL), and even if you do make a DLL version it's annoying to the client to have to hook yourself up to the DLL_PROCESS_ATTACH to set up your TLS (if you want to use the MS-recommended way of doing TLS in DLLs). It just doesn't work very well for modular code components. The result is that if you are writing code that is supposed to always work on Win32 you have to do your own TLS system.

(same is true for Xenon XEX's and maybe PS3 PRX's though I'm not completely sure about that; I'm not aware of any other platforms that are broken like this, but there probably are some).

Anyway, so you want TLS but you can't use the compiler's built-in "__thread" mechanism. You can do something like this :



#define TLSVAR_USE_CINIT
//#define TLSVAR_USE_COMPILER_TLS

// T has to be castable to void *
template <typename T>
struct TLSVar
{
public:

    // shared between threads :
    uint32 m_tls_index;
        
    // AllocIndex is thread-safe
    // it does wait-free speculative singleton construction
    static uint32 AllocIndex(volatile uint32 * pSharedIndex)
    {
        uint32 index = LoadRelaxed(pSharedIndex);
        if ( index == TLS_OUT_OF_INDEXES )
        {
            index = TlsAlloc();
            // store my index :
            uint32 oldVal = AtomicCMPX32(pSharedIndex,TLS_OUT_OF_INDEXES,index);
            if ( oldVal != TLS_OUT_OF_INDEXES )
            {
                // already one in there
                TlsFree(index);
                index = oldVal;
            }
        }
        return index;
    }

    #ifdef TLSVAR_USE_CINIT
    TLSVar() : m_tls_index(TLS_OUT_OF_INDEXES)
    {
        AllocIndex(&m_tls_index);
    }
    #endif
    
    T & Ref()
    {
        #ifndef TLSVAR_USE_CINIT
        AllocIndex(&m_tls_index);
        #endif
        
        // initial value in TLS slot :
        //  this has to be done once per thread
        LPVOID tls = TlsGetValue(m_tls_index);
        if ( tls == NULL )
        {
            T * pT = new T;
            tls = (LPVOID)pT;
            TlsSetValue(m_tls_index,tls);
        }
        
        return *((T *) tls);
    }
    
    operator T & ()
    {
        return Ref();
    }
    
    void operator = (const T t)
    {
        Ref() = t;
    }
    
};

#ifdef TLSVAR_USE_COMPILER_TLS

#ifdef _MSC_VER
#define TLS_VAR(type,name)  __declspec(thread) type name = (type)0;
#else
#define TLS_VAR(type,name)  __thread type name = (type)0;
#endif

#else // TLSVAR_USE_COMPILER_TLS

#ifdef TLSVAR_USE_CINIT
#define TLS_VAR(type,name) TLSVar<type> name;
#else
// use static initializer, not cinit :
#define TLS_VAR(type,name) TLSVar<type> name = { TLS_OUT_OF_INDEXES };
#endif

#endif // TLSVAR_USE_COMPILER_TLS


A few notes :

I made it able to work with cinit or without. The cinit version is somewhat preferrable. I'm not sure if cinit always works on all platforms with modular code loading, so I made it optional.

AllocIndex uses the preferred modern way of instantiating shared singletons. It is "wait free", which means all threads always makes progress in bounded time. In the case of contention there is an unnecessary alloc and free, which is unlikely and usually not a big deal. Whenever an extra alloc/free is not a big deal, this is the best way to do a singleton. If the extra alloc/free is a big deal, then a block is preferred.

Some platforms have a small limit on the number of TLS slots. If you use the compiler __thread mechanism, all your TLS variables get put together in a struct that goes in one TLS slot. If you can't use that, then it's probably best to do the same thing by hand - make a struct that contains everything you want to be thread-local and then just use a single slot for the struct. Unfortunately this is ugly for software engineering as many disparate systems might want to use TLS and they all have to share a globally visible struct def.

Handling freeing at thread shutdown is an annoyance. The pthreads tls mechanism lets you register a function callback for each tls slot which can do freeing at thread shutdown. I'm sure there's some way to get a thread-shutdown callback in Windows. Personally I prefer to use a model where all my threads live for the lifetime of the app (there are no short-lifetime threads), so I just don't give a shit about cleaning up the TLS, but that may not be acceptible to everyone, so you will have to deal with this.

old rants