7/14/2011

07-14-11 - compare_exchange_strong vs compare_exchange_weak

The C++0x standard was revised a while ago to split compare_exchange (aka CAS) into two ops. A quick note on the difference :

bool compare_exchange_weak( T * ptr, T * old, T new );

bool compare_exchange_strong( T * ptr, T * old, T new );

(BTW in the standard "old" is actually a reference, which is a damn shitty thing to do because it makes it very non-transparent that "old" gets mutated by these functions, so I am showing it as a pointer).

both try to do :


atomically {
  if ( *ptr == *old ) { *ptr = new; return true; }
  else { *old = *ptr; return false; }
}

the difference is that compare_exchange_weak can also return false for spurious failure. (the original C++0x definition of CAS always allowed spurious failure; the new thing is the _strong version which doesn't).

If it returns due to spurious failure, then *old might be left untouched (and in fact, *ptr might be equal to *old but we failed anyway).

If spurious failure can only occur due to contention, then you can still gaurantee progress. In fact in the real world, I believe that LL-SC architectures cannot gaurantee progress, because you can get spurious failure if there is contention anywhere on the cache line, and you need that contention to be specifically on your atomic variable to gaurantee progress. (I guess if you are really worried about this, then you should ensure that atomic variables are padded so they get their own cache line, which is generally good practice for performance reasons anyway).

On "cache line lock" type architectures like x86, there is no such thing as spurious failure. compare_exchange just maps to "cmpxchg" instruction and you always get the swap that you want. (it can still fail of course, if the value was not equal to the old value, but it will reload old). (BTW it's likely that x86 will move away from this in the future, because it's very expensive for very high core counts)

compare_exchange_weak exists for LL-SC (load linked/store conditional) type architectures (Power, ARM, basically everything except x86), because on them compare_exchange_strong must be implemented as a loop, while compare_exchange_weak can be non-looping. For example :

On ARM, compare_exchange_weak is something like :

compare_exchange_weak:

  ldrex     // load with reservation
  teq       // test equality
  strexeq   // store if equal
and strexeq can fail for two reasons - either because they weren't equal, or because the reservation was lost (because someone else touched our cache line).

To implement compare_exchange_strong you need a loop :

compare_exchange_strong:

  while ( ! compare_exchange_weak(ptr,old,new) ) { }

(note that you might be tempted to put a (*old = *ptr) inside the loop, but that's probably not a good idea, and not necessary, because compare_exchange_weak will eventually load *ptr into *old itself when it doesn't fail spuriously).

The funny bit is that when you use compare_exchange you often loop anyway. For example say I want to use compare_exchange_strong to increment a value, I have to do :


cur = *ptr;
while( ! compare_exchange_strong(ptr,&cur,cur+1) ) { }

(note it's a little subtle that this works - when compare_exchange_strong fails, it's because somebody else touched *ptr, so we then reload cur (this is why "old" is passed by address), so you then recompute cur+1 from the new value; so with the compare_exchange_strong, cur has a different value each time around this loop.)

But on an LL-SC architecture like ARM this becomes a loop on a loop, which is dumb when you could get the same result with a single loop :


cur = *ptr;
while( ! compare_exchange_weak(ptr,&cur,cur+1) ) { }

Note that with this loop now cur does *not* always take a new value each time around the loop (it does when it fails due to contention, but not when it fails just due to reservation-lost), but the end result is the same.

So that's why compare_exchange_weak exists, but you might ask why compare_exchange_strong exists. If we always use loops like this, then there's no need for it. But we don't always use loops like this, or we might want to loop at the much higher level. For example you might have something like :

bool SpinLock_TryLock(int * lock)
{
  int zero = 0;
  return compare_exchange_strong(lock,&zero,1);
}
which returns false if it couldn't get the lock (and then might do an OS wait) - you don't want to return false just because of a spurious failure. (that's not a great example, maybe I'll think of a better one later).

(BTW I think the C++0x stuff is a little bit broken, like most of C standardization, because they are trying to straddle this middle ground of exposing the efficient hardware-specific ways of doing things, but they don't actually expose enough to map directly to the hardware, and they also aren't high level enough to separate you from knowing about the hardware. For example none of their memory model actually maps directly to what x86 provides, therefore there are some very efficient x86-specific ways to do threading ops that cannot be expressed portable in C++0x. Similarly on LL-SC architectures, it would be preferrable to just have access to LL-SC directly.

I'd rather see things in the standard like "if LL-SC exist on this architecture, then they can be invoked via __ll() and __sc()" ; more generally I wish C had more conditionals built into the language, that would be so much better for real portability, as opposed to the current mess where they pretend that the language is portable but it actually isn't so you have to create your own mess of meta-language through #defines).

07-14-11 - ARM Atomics

Some quick notes for reference in the future.

In general when I'm porting atomics to a new platform, what I would really like is a document from the hardware maker that describes their memory semantics and cache model in detail. Unfortunately that can be very hard to find. You also need to know how the compiler interacts with that memory model (eg. what does volatile mean, how do I generate compiler reorder barriers, etc). Again that can be very hard to find, particularly because so many compilers now are based on GCC and the GCC guys are generally stubborn punk-asses about clearly defining how they behave in the "undefined" parts of C which are so crucial to real code.

So assuming you can't find any decent documentation, the next place to look is some of the large cross-platform codebases. Probably the best one of all is Linux, because it's decently well tested. Unfortunately you can't just copy code from these since most of them are GPL, but you can use them as educational material to figure out the memory semantics of a platform. So some things we learn from Linux straight off :

1. Before ARMv6 you're fucked. There are no real atomic ops (SWP is not good enough) so you have to use some locking/critical-section mechanism to do atomics. Linux in the kernel does this by blocking interrupts, doing the atomic, then turning them back on. If you're not in the kernel, on Linux there's a secret syscall function pointer you can use, or non-Linux you have to use SWP to implement a spinlock which you then use to do CAS and such.

2. With ARMv6 you can use ldrex/strex , which seems to be your standard LL-SC kind of thing.

3. If you're SMP you need full memory barriers for memory ordering.

One thing I don't know is whether any of the Apple/Android consumer ARM multi-core chips are actually SMP ; eg. do they have separate caches, or are they shared single cache with multiple execution units?

Some reference I've found :

[pulseaudio-discuss] Atomic operations on ARM
Wandering Coder - ARM
QEmu - Commit - ViewGit
pulseaudio-discuss Atomic operations on ARM 1
Old Nabble - gcc - Dev - atomic accesses
Linux Kernel Locking Techniques
Linux Kernel ARM atopics
Debian -- Details of package libatomic-ops-dev in sid
Data alignment Straighten up and fly right
Broken ARM atomic ops wrt memory barriers (was [PATCH] Add cmpxchg support for ARMv6+ systems) - Patchwork
Au�ergew�hnlicher Migrationsdruck
Atomic - GCC Wiki
ARM Technical Resources
ARM Information Center

ARM RealView compiler has some interesting intrinsics that are not documented very well :


ARM RealView has :
    __force_stores
    __memory_changed
    __schedule_barrier
    __yield
    __strexeq

one trick in this kind of work is to find a compiler that has intrinsics you want and then just look at what assembly is generated so that you can see how to generate the op you want on that platform.

(but beware, because the intrinsics are not always correct; in particular the GCC __sync ops are not all right, sometimes have bugs, and sometimes their behavior is "correct" but doesn't match the documentation; it's very hard to find correct documentation on what memory semantics the GCC __sync ops actually gaurantee).

Anyway, maybe I'll update this when I get some more information / do some more research.

7/13/2011

07-13-11 - Good threading design for games

I want to back up and talk a bit about how threading should be used in modern games and what makes good or bad threading design in general.

Games are sort of a weird intermediate zone. We aren't quite what would be considered "real-time" apps (a true "real-time" app sets itself to the maximum possible priority and basically never allows any thread switches), nor are we quite "high performance computing" (which emphasizes throughput over latency, and also takes over the whole machine), but we also aren't just normal GUI apps (that can tolerate long losses of cpu time). We sort of have to compromise.

Let me lay out a few principals of what I believe is good threading for games, and why :

1. Make threads go to sleep when they have nothing to do (eg. don't spin). A true real-time or high-performance app might have worker threads that just spin on popping work items off a queue. This allows them to have few-hundred-clock latencies gauranteed. This is not appropriate for games. You are pretty much always running on a system that you need to share. Obviously on the PC they may have other apps running, but even on consoles there are system threads that you need to share with and don't have full control over. The best time to do this is when your threads are idle, so make your threads really go to sleep when they are idle.

2. Never use Sleep(n). Not ever. Not even once. Not even in your spin-backoff loops. In literature you will often see people talk of exponential backoffs and such. Not for games.

The issue is that even a Sleep(1) is 1 milli. 1 milli is *forever* in a game. If the sleep is on the critical path to getting a frame out, you can add 1 milli to your frame time. If you hit another sleep along the critical path you have another millisecond. It's no good.

3. All Waits should be on events/sempahores/what have you (OS waitable handles) so that sleeping threads are woken up when they can proceed, not on a timer.

4. Thread switches should be your #1 concern. (other than deadlocks and races and such things that are just bugs). Taking a mutex is 100 clocks or so. Switching threads is 10,000 clocks or so. It's very very very bad. The next couple of points are related to this :

5. "Spurious" or "polling" wakeups are performance disasters. If a thread has to wake up, check a condition, and then goes back to sleep, you just completely wasted 10,000 clocks. Ensure that threads are waiting on a specific condition, and they are only woken up when that condition is true. This can be a bit tricky when you need to wait on multiple conditions (and don't have Win32's nice WaitForMultiple to help you).

6. You should sacrifice some micro-efficiency to reduce thread thrashing. For example, consider the blocking SPCS queue using semaphore that we talked about recently. If you are using that to push work items that take very little time, you can have the following pattern :

  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
constantly switching. Big disaster. There are a few solutions. The best in this case would be to detect that the work item was very trivial and just do it immediately on the main thread rather than pushing it to a worker. Another would be to batch up a packet of work and send all of them at once instead of posting the semaphore each time. Another is to play with priorities - bump up your priority while you are making work items, and then bump it back down when you want them all to fire.

Basically doing some more work and checks that reduce your max throughput but avoids some bad cases is the way to go.

7. Mutexes can cause bad thread thrashing. The problem with mutexes (critical sections) is not that they are "slow" (oh no, 100-200 clocks big whoop) in the un-contended case, it's what they do under contention. They block the thread and swap it out. That's fine if that's actually what you want to do, but usually it's not.

Consider for example a queue that's mutex protected. There are a bunch of bad articles about the performance of mutex-protected queues vs. lock-free queues. That completely misses the point. The problem with a mutex-protected queue is that when you try to push an item and hit contention, your thread swaps out. Then maybe the popper gets an item and your thread swaps back in. Thread swapping in & out just to push an item is terrible. (obviously mutexes that spin before swapping try to prevent this, but they don't prevent the worst case, which means you can have unpredictable performance, and in games the most important thing is generally the worst case - you want to minimize your maximum frame time)

Basically, mutexing is not a good reason to swap threads. You should swap threads when you are out of work to do, not just to let someone get access to a blocked shared variable, and then you'll run some more when they're done.

8. Over-subscription is bad. Over-subscription (having more threads than cores) inherently creates thread switching. This is okay if the threads are waiting on non-computational signals (eg. IO or network), but otherwise it's unnecessary. Some people think it's elegant to make a thread per operation type, eg. a ray cast thread, a collision thread, a particle thread, etc. but in reality all that does is create thread switch costs that you don't need to have.

In video games, you basically don't want the time-slicing aspect of threads to ever happen (in the absense of other apps). You want a thread to run until it has to wait for something. Over-subscribing will just cause time-slice thread switching that you don't need. You don't want your render thread and your physics thread to time-slice and swap back and forth on one core. If they need to share a core, you want all your physics work to run, then all your render work to run, sequentially.

9. Generally a thread per task is bad. Prefer a thread per core. Make a task system that can run on any thread. You can then manually sequence tasks on the cores by assigning them to threads. So in the example above instead have a physics task and a render task, which might be assigned to different cores, or might be assigned to one thread on one core and run sequentially.

7/10/2011

07-10-11 - Mystery - Do you ever need Total Order (seq_cst) -

The seq_cst memory model ops in C++0x are very mysterious to me. In particular, when exactly do you need them and why? Let's talk about some issues.

First of all, one quirk is that C++0x has no way of expressing a #StoreLoad barrier. Because of that, you often wind up having to use seq_cst mem ops just to get a StoreLoad. (for example in "eventcount" seq_cst is needed, but that's just to get the StoreLoad). But this is rather unfortunate because seq_cst is actually much heavier than StoreLoad. It also enforces "total order".

( there are some more quirks in the details of C++0x ; for example an acq_rel or seq_cst is really only the barrier you think it is if it's an "exchange" (read and write). That is, a store marked acq_rel is not actually acq_rel. Similarly, seq_cst fences don't act like you think they do (see Dmitry's posts on this); in particular you cannot just translate a #StoreLoad from another language into a fence(seq_cst) in C++0x (UPDATE : well, usually you can); also seq_cst in C++0x only schedules against other atomic ops with ordering constraints; they do not schedule against non-atomic memory loads, so in that sense it is not like a memory barrier at all ) (UPDATE : this is basically true, but see clarification in newer posts here : cbloom rants 05-31-12 - On C++ Atomic Fences Part 2 and cbloom rants 05-30-12 - On C++ Atomic Fences ).

Let's compare against a theoretical "all_barriers" op, which has LoadLoad,StoreStore,StoreLoad, and LoadStore barriers on both sides of it. seq_cst is the same thing, but also enforces total order. Why would we ever want that?

What does total order do? Well the classic toy example is something like this :


int a,b,c,d = 0,0,0,0;

thread1 : 
  a = 1;
  b = 1;

thread2 :
  c = 1;
  d = 1;

thread3 :
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

thread4:
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

if all the ops are "all_barriers" then we know that the write to b must be after the write to a, and the write to d must be after the write to c, but the ops to {ab} can schedule against the ops to {cd} in any way - and in particular thread 3 and 4 can see them in different ways. So for example with all_barriers, this is a valid output :

thread3 :
 sees {abcd}
  1,0,0,0

thread4 :
 sees {cdab}
  0,0,1,1

if you use seq_cst the order of the stores on the bus is a single linear order. eg. maybe it's {acbd} or whatever, the point is its the same for all observers. In particular, if we pretend that thread3 and 4 can run instantly and atomically then they would print the same thing.

ADDENDUM : An actual working demonstration goes like this :

shared int a,b = 0,0;
result int c,d = 0,0;

thread1 :
  a = 1;

thread2 : 
  b = 1;

thread3 :
  local int A = a;
  local int B = b;
  if ( A == 1 && B == 0 )
    c = 1;

thread4 :
  local int B = b;
  local int A = a;
  if ( B == 1 && A == 0 )
    d = 1;

// invariant :

if memory order is seq_cst then (c+d) == 2 is impossible
any other memory order and (c+d) == 2 is possible

that is, threads 1 & 2 independently write to a & b. If you use a total order, then thread 3 and 4 must see either {ab} or {ba} - both are valid, but the point is it's the same. If you don't use total order then they could see the order differently.

We then check the order in a race-free way. c=1 can only be set if the order was {ab} , d=1 can only be set if the order was {ba} , therefore with seq_cst it's impossible to see both c=1 and d=1.

(end of addendum).

(note that if you only talk about two threads, then this "total order" issue never comes up, and acq_rel is the same as seq_cst; it only becomes different when three or more threads are watching each other)

But this is quite a weird thing to require; does it ever actually matter?

(just to confuse things, Intel recently changed the spec of the MFENCE instruction to enfore "causal consistency" instead of "sequential consistency" ; presumably this is part of the push to higher core counts and the goal is to get away from the very expensive system-wide sequence point that sequential consistency requires. "causal consistency" provides a total order only for operations which are "causally related" - eg. at the same memory location, or tied together by being used together in an atomic op).

(briefly : in C++0x a seq_cst fence acts to strictly order only other seq_cst ops; it also acts as acq_rel fence; it does not act to order ops that are not seq_cst ; for example it is not a #StoreStore barrier for two relaxes stores, one before and one after the fence (unless they are seq_cst themselves) ; part of the confusion comes from the fact that in practice on most platforms a seq_cst fence is implemented with an instruction like MFENCE which is in fact a stronger barrier - it orders all ops as being strictly before the mfence or strictly after).

(as usual the C++ language lawyer guys are spiraling off into a pit of abstraction and mildly fucking up; I'll do a followup post on how it should have been done).

Dekker's mutex is sometimes cited as an example that needs sequential consistency, because it is doing that sort of weird stuff where it uses multiple variables and needs the writes to them to go in the right order, but it actually doesn't need seq_cst. All it needs are #StoreLoad barriers. ( see here - though I think this code is actually broken)

Unfortunately StoreLoad is very awkward to express in C++0x ; one way to do it is to change the load into an AtomicIncrement by 0, so that it's a read-modify-write (RMW), then you can use a #StoreStore (which a normal "release" constraint). For example Dekker's lock can be expressed like this :


    void lock(int tid)
    {
        int him = tid^1;
        rl::linear_backoff b1;
        
        flag[tid]($).store(1,std::mo_relaxed);

        // #StoreLoad here

        while ( flag[him]($).fetch_add(0,std::mo_acq_rel) )
        {       
            b1.yield($);
            if ( turn($).load(std::mo_relaxed) != tid )
            {
                rl::linear_backoff b2;
                flag[tid]($).store(0,std::mo_relaxed);
                while ( turn($).load(std::mo_relaxed) != tid )
                {
                    b2.yield($);
                }   
                flag[tid]($).store(1,std::mo_relaxed);

                //#StoreLoad here
            }
        }
    }
    
    void unlock(int tid)
    {
        turn($).store( tid^1  , std::mo_release);
        flag[tid]($).store( 0 , std::mo_release);
    }

(backoffs are needed to make Relacy work). Now obviously the RMW is an unfortunate stupid expense, it requires a chunk of CPU time to hold that cache line, but it might be better than a seq_cst op, depending on how that's implemented.

So what I'm struggling with is imagining a situation that actually needs "total order" (and "needs" in a fundamental sense, not just because the coder was a dummy doing silly things). That is, you can easily cook up bad code that needs total order because it is making dumb assumptions about disparate memory ops becoming visible in the same order on different processors, but that's just bad code.

7/09/2011

07-09-11 - TLS for Win32

So as noted in previous discussion of TLS , there's this annoying problem that it's broken for DLL's in Win32 (pre-Vista).

The real annoyance as a library writer is that even if you compile a .lib, somebody might want to use your lib in a DLL, and you can't know that in advance (I guess you could build a separate version of your lib for people who want to put it in a DLL), and even if you do make a DLL version it's annoying to the client to have to hook yourself up to the DLL_PROCESS_ATTACH to set up your TLS (if you want to use the MS-recommended way of doing TLS in DLLs). It just doesn't work very well for modular code components. The result is that if you are writing code that is supposed to always work on Win32 you have to do your own TLS system.

(same is true for Xenon XEX's and maybe PS3 PRX's though I'm not completely sure about that; I'm not aware of any other platforms that are broken like this, but there probably are some).

Anyway, so you want TLS but you can't use the compiler's built-in "__thread" mechanism. You can do something like this :



#define TLSVAR_USE_CINIT
//#define TLSVAR_USE_COMPILER_TLS

// T has to be castable to void *
template <typename T>
struct TLSVar
{
public:

    // shared between threads :
    uint32 m_tls_index;
        
    // AllocIndex is thread-safe
    // it does wait-free speculative singleton construction
    static uint32 AllocIndex(volatile uint32 * pSharedIndex)
    {
        uint32 index = LoadRelaxed(pSharedIndex);
        if ( index == TLS_OUT_OF_INDEXES )
        {
            index = TlsAlloc();
            // store my index :
            uint32 oldVal = AtomicCMPX32(pSharedIndex,TLS_OUT_OF_INDEXES,index);
            if ( oldVal != TLS_OUT_OF_INDEXES )
            {
                // already one in there
                TlsFree(index);
                index = oldVal;
            }
        }
        return index;
    }

    #ifdef TLSVAR_USE_CINIT
    TLSVar() : m_tls_index(TLS_OUT_OF_INDEXES)
    {
        AllocIndex(&m_tls_index);
    }
    #endif
    
    T & Ref()
    {
        #ifndef TLSVAR_USE_CINIT
        AllocIndex(&m_tls_index);
        #endif
        
        // initial value in TLS slot :
        //  this has to be done once per thread
        LPVOID tls = TlsGetValue(m_tls_index);
        if ( tls == NULL )
        {
            T * pT = new T;
            tls = (LPVOID)pT;
            TlsSetValue(m_tls_index,tls);
        }
        
        return *((T *) tls);
    }
    
    operator T & ()
    {
        return Ref();
    }
    
    void operator = (const T t)
    {
        Ref() = t;
    }
    
};

#ifdef TLSVAR_USE_COMPILER_TLS

#ifdef _MSC_VER
#define TLS_VAR(type,name)  __declspec(thread) type name = (type)0;
#else
#define TLS_VAR(type,name)  __thread type name = (type)0;
#endif

#else // TLSVAR_USE_COMPILER_TLS

#ifdef TLSVAR_USE_CINIT
#define TLS_VAR(type,name) TLSVar<type> name;
#else
// use static initializer, not cinit :
#define TLS_VAR(type,name) TLSVar<type> name = { TLS_OUT_OF_INDEXES };
#endif

#endif // TLSVAR_USE_COMPILER_TLS


A few notes :

I made it able to work with cinit or without. The cinit version is somewhat preferrable. I'm not sure if cinit always works on all platforms with modular code loading, so I made it optional.

AllocIndex uses the preferred modern way of instantiating shared singletons. It is "wait free", which means all threads always makes progress in bounded time. In the case of contention there is an unnecessary alloc and free, which is unlikely and usually not a big deal. Whenever an extra alloc/free is not a big deal, this is the best way to do a singleton. If the extra alloc/free is a big deal, then a block is preferred.

Some platforms have a small limit on the number of TLS slots. If you use the compiler __thread mechanism, all your TLS variables get put together in a struct that goes in one TLS slot. If you can't use that, then it's probably best to do the same thing by hand - make a struct that contains everything you want to be thread-local and then just use a single slot for the struct. Unfortunately this is ugly for software engineering as many disparate systems might want to use TLS and they all have to share a globally visible struct def.

Handling freeing at thread shutdown is an annoyance. The pthreads tls mechanism lets you register a function callback for each tls slot which can do freeing at thread shutdown. I'm sure there's some way to get a thread-shutdown callback in Windows. Personally I prefer to use a model where all my threads live for the lifetime of the app (there are no short-lifetime threads), so I just don't give a shit about cleaning up the TLS, but that may not be acceptible to everyone, so you will have to deal with this.

07-09-11 - LockFree - Thomasson's simple MPMC

Warming back up into this stuff, here's some very simple algos to study.

first fastsemaphore :


class fastsemaphore
{
    rl::atomic<long> m_count;
    rl::rl_sem_t m_waitset;

public:
    fastsemaphore(long count = 0)
    :   m_count(count)
    {
        RL_ASSERT(count > -1);
        sem_init(&m_waitset, 0, 0);
    }

    ~fastsemaphore()
    {
        sem_destroy(&m_waitset);
    }

public:
    void post()
    {
        if (m_count($).fetch_add(1) < 0)
        {
            sem_post(&m_waitset);
        }
    }

    void wait()
    {
        if (m_count($).fetch_add(-1) < 1)
        {
            // loop because sem_wait returns non-zero for spurious failure
            while (sem_wait(&m_waitset));
        }
    }
};

Most code I post will be in Relacy notation, which is just modified C++0x. Note that C++0x atomics without explicit memory ordering specifications (such as the fetch_adds here) default to memory_order_seq_cst (sequential consistency).

Basically your typical OS "semaphore" is a very heavy kernel-space object (on Win32 for example, semaphores are cross-process). Just doing P or V on it even when you don't modify wait states is very expensive. This is just a user-space wrapper which only calls to the kernel semaphore if it is at an edge transition that will cause a thread to either go to sleep or wake up.

So this is a simple thing that's nice to have. Note that m_count is always 0 or negative. If it's negative it's the (minus) number of threads that are sleeping on that semaphore. (between post and wakeup a thread can be sleeping but no longer counted, so we should say that threads which are not in minus m_count are either running or pending-running).

( see here )

Now we can look at Thomasson's very simple MPMC bounded blocking queue :


template<typename T, std::size_t T_depth>
class mpmcq
{
    rl::atomic<T*> m_slots[T_depth];
    rl::atomic<std::size_t> m_push_idx;
    rl::atomic<std::size_t> m_pop_idx;
    fastsemaphore m_push_sem;
    fastsemaphore m_pop_sem;

public:
    mpmcq() 
    :   m_push_idx(T_depth), 
        m_pop_idx(0),
        m_push_sem(T_depth),
        m_pop_sem(0)
    {
        for (std::size_t i = 0; i < T_depth; ++i)
        {
            m_slots[i]($).store(NULL);
        }
    }

public:
    void push(T* ptr)
    {
        m_push_sem.wait();

        std::size_t idx = m_push_idx($).fetch_add(1) & (T_depth - 1);

        rl::backoff backoff;

        while (m_slots[idx]($).load())
        {
            backoff.yield($);
        }

        RL_ASSERT(! m_slots[idx]($).load());

        m_slots[idx]($).store(ptr);

        m_pop_sem.post();
    }


    T* pop()
    {
        m_pop_sem.wait();

        std::size_t idx = m_pop_idx($).fetch_add(1) & (T_depth - 1);

        T* ptr;
        rl::backoff backoff;

        while ( (ptr = m_slots[idx]($).load()) == NULL )
        {
            backoff.yield($);
        }
        
        m_slots[idx]($).store(NULL);

        m_push_sem.post();

        return ptr;
    }
};

First let's understand what's going on here. It's just an array of slots with a reader index and writer index that loop around. "pop_sem" counts the number of filled slots - so the popper waits on that semaphore to see filled slots be non-zero. "push_sem" counts the number of available slots - so the pusher waits on that being greater than zero to be able to fill a slot.

So the producer and consumer both nicely go to sleep and wake each other when they should. Also because we use "fastsemaphore" they have reasonably low overhead when they are in the non-sleeping case.

Now, why is the weird backoff logic there? It's because of the "M" (for multiple) in MPMC. If this was an SPSC queue then it could be much simpler :


    void push(T* ptr)
    {
        m_push_sem.wait();

        std::size_t idx = m_push_idx($).fetch_add(1) & (T_depth - 1);

        RL_ASSERT(! m_slots[idx]($).load());

        m_slots[idx]($).store(ptr);

        m_pop_sem.post();
    }


    T* pop()
    {
        m_pop_sem.wait();

        std::size_t idx = m_pop_idx($).fetch_add(1) & (T_depth - 1);

        /* (*1) */

        T* ptr = m_slots[idx]($).exchange(NULL);
        RL_ASSERT( ptr != NULL );

        m_push_sem.post();

        return ptr;
    }

which should be pretty obviously correct for SPSC.

But, now consider you have multiple consumers and the queue is completely full.

Consumer 1 gets a pop_idx = 2. But then at (*1) it swaps out and doesn't run any more.

Consumer 2 gets a pop_idx = 3 and runs through and posts to the push semaphore.

Now a producer runs and gets push_idx = 2. It believes there is an empty slot it can write to, but it looks in slot 2 and there's still something there (because consumer 1 hasn't cleared it's slot yet). So, it has to do the backoff-yield loop to give consumer 1 some CPU time to let it run.

So the MPMC with backoff-yield works, but it's not great. As long as the queue is near empty it works reasonably well, but when it's full it acts like a mutex-based queue, in that one consumer being swapped out can block all your pushers from running (and because it's just a busy wait here, the normal OS hacks to rescue you (like Windows priority boosts) won't work here (this kind of thing is exactly why the Windows scheduler has so many hacks and why despite your whining you really do want it to be like that)).

7/08/2011

07-08-11 - Who ordered Event Count -

Say you have something like a lock-free queue , and you wish to go into a proper OS-level thread sleep when it's empty (eg. not just busy wait).

Obviously you try something like :


popper :

item = queue.pop();
if ( item == NULL )
{
  Wait(handle);
}


pusher :

was_empty = queue.push();
if ( was_empty )
{
  Signal(handle);
}

where we have extended queue.push to atomically check if the queue was empty and return a bool (this is easy to do for most lock-free queue implementations).

This doesn't work. The problem is that between the pop() and the Wait(), somebody could push something on the queue. That means the queue is non-empty at that point, but you go to sleep anyway, and nobody will ever wake you up.

Now, one obvious solution is to put a Mutex around the whole operation and use a "Condition Variable" (which is just a way of associating a sleep/wakeup with a mutex). That way the mutex prevents the state of the queue from changing while you decide to go to sleep. But we don't want to do that today because the whole point of our lock-free queue is that it's lock-free, and a mutex would spoil that. (actually I suppose the classic solution to this problem would be to use a counting semaphore and inc it on each push and dec it on each pop, but that has even more overhead if it's a kernel semaphore). Basically we want these specific fast paths and slow paths :


fast path :
  when popper can get an item without sleeping
  when pusher adds an item and there are no waiters

slow path :
  when popper has no work to do and goes to sleep
  when pusher needs to wake a popper

we're okay with the sleep/wakeup part being slow because that involves thread transitions anyway, so it's always slow and complicated. But in the other case where a core is just sending a message to another running core it should be mutex-free.

So, the obvious answer is to do a kind of double-check, something like :


popper :

item = queue.pop(); 
if ( item ) return item;  // *1
atomic_waiters ++;   // *2
item = queue.pop();  // *3
if ( item ) { atomic_waiters--; return item; }
if ( item == NULL )
{
  Wait(handle);
}
atomic_waiters--;


pusher :

queue.push();
if ( atomic_waiters > 0 )
{
  Signal(handle);
}

this gets the basics right. First popper has a fast path at *1 - if it gets an item, it's done. It then registers the fact that it's going to wait to a shared variable at *2. Then it double checks at *3. The double check is not an optimization to avoid sleeping your thread, it's crucial for correctness. The issue is that the popper could swap out between *1 and *2, and the pusher could then run completely, and it will see waiters == 0 and not do a signal. So the double-check at *3 catches this.

There's a performance bug with this code as written - if the queue goes empty then you do lots of pushes, all those pushes send the signal. You might be tempted to fix that by moving the "atomics_waiters--" line to the pusher side (before the signal), but that creates a race. You could fix that but then you spot a bigger problem :

This code doesn't work at all if you have multiple pushers or poppers. The problem is "lost wakeups". Basically if there are multiple poppers going into wait at the same time, the pusher may think it's done the wakeups it needs to, but it hasn't, and a popper goes into a wait forever.

To fix this you need a real proper "event count". What a proper event_count does is register the waiter at a certain point in time. The usage is like this :


popper :

item = queue.pop(); 
if ( item ) return item;
count = event_count.get_event_count();
item = queue.pop();
if ( item ) { event_count.cancel_wait(count); return item; }
if ( item == NULL )
{
  event_count.wait_on_count(count);
}


pusher :

queue.push();
event_count.signal();

Now, as before get_event_count() registers in an atomic visible variable that I want to wait on something (most people call this function prepare_wait()), but it also records the current "event count" to identify the wait (this is just the number of pushes, or the number of signals if you like). Then wait_on_count() only actually does the wait if the event_count is still on the same count as when I did get_wait_count - if the internal counter has advanced the wait is not done. signal() is a fast nop if there are no waiters, and increments the internal count.

This eliminates the lost wakeup problem, because if the "event_count" has advanced (and signaled some other popper, and won't signal again) then you will simply not go into the Wait.

Basically it's exactly like a Windows Event in that you can wait and signal, but with the added feature that you can record a place in time on the event, and then only do the Wait if that time has not advanced between your recording and the call to Wait.

It turns out that event_count and condition variables are closely related; in particular, one can very easily be implemented in terms of the other. (I should note that the exact semantics of pthread cond_var are *not* easy to implement in this way, but a "condition variable" in the broader sense need not comply with their exact specs).

Maybe in the future I'll get into how to implement event_count and condition_var.

BTW eventcount is the elegant solution to the problem of Waiting on Thread Events discussed previously.


ADDENDUM : if you like, eventcount is a way of doing Windows' PulseEvent correctly.

"Event" on Win32 is basically a "Gate" ; it's either open or closed. SetEvent makes it open. When you Wait() on the Event, if the gate is open you walk through, if it's closed you stop. The normal way to use it is with an auto-reset event, which means when you walk through the gate you close it behind yourself (atomically).

The idea of the Win32 PulseEvent API is to briefly open the gate and let through someone who was previously waiting on the gate, and then close it again. Unfortunately, PulseEvent is horribly broken by design and almost always causes a race, which leads most people to recommend against ever using PulseEvent (or manual reset events). (I'm sure it is possible to write correct code using PulseEvent, for example the race may be benign and just be a performance bug, but it is wise to follow this advice and not use it).

For example the standard Queue code using PulseEvent :

popper :
  node = queue.pop();
  if ( ! node ) Wait(event);

pusher :
  queue.push(node);
  PulseEvent(event);

is a totally broken race (if the popper is between getting a null node and enterring the wait, it doesn't see the event), and most PulseEvent code is similarly broken.

eventcount's Signal is just like PulseEvent - it only signals people who were previously waiting, it doesn't change the eventcount into a signalled state. But it doesn't suffer from the race of PulseEvent because it has a consistent way of defining the moment in "time" when the event fires.

07-08-11 - Event Count and Condition Variable

If you have either event_count or condition_variable, it's pretty straightforward to get the other from the one you have.

eventcount from condition_variable :

# by Chris M Thomasson
#
# class eventcount {  
# public:  
#   typedef unsigned long key_type;  
#   
#   
# private:  
#   mutable rl::atomic<key_type> m_count;  
#   rl::mutex m_mutex;  
#   rl::condition_variable m_cond;  
#   
#   
#   void prv_signal(key_type key) {  
#     if (key & 1) {  
#       m_mutex.lock($);  
#       while (! m_count($).compare_exchange_weak(key, (key + 2) & ~1,   
#         rl::memory_order_seq_cst));  
#       m_mutex.unlock($);  
#       m_cond.notify_all($);  
#     }  
#   }  
#   
#   
# public:  
#   eventcount() {  
#     m_count($).store(0, rl::memory_order_relaxed);  
#   }  
#   
#   
# public:  
#   key_type get() const {   // aka prepare_wait
#     return m_count($).fetch_or(1, rl::memory_order_acquire);  
#   }  
#   
#   
#   void signal() {  // aka notify_one
#     prv_signal(m_count($).fetch_add(0, rl::memory_order_seq_cst));  
#   }  
#   
#   
#   void signal_relaxed() {  
#     prv_signal(m_count($).load(rl::memory_order_relaxed));  
#   }  
#   
#   
#   void wait(key_type cmp) {  
#     m_mutex.lock($);  
#     if ((m_count($).load(rl::memory_order_seq_cst) & ~1) == (cmp & ~1)) {  
#       m_cond.wait(m_mutex, $);  
#     }  
#     m_mutex.unlock($);  
#   }  
# };  
#
and condition variable from event count :
by Dmitry V'jukov :

   1. class condition_variable  
   2. {  
   3.     eventcount ec_;  
   4.   
   5. public:  
   6.     void wait(mutex& mtx)  
   7.     {  
   8.         int count = ec_.prepare_wait();  
   9.         mtx.unlock();  
  10.         ec_.wait(count);  
  11.         mtx.lock();  
  12.     }  
  13.   
  14.     void signal()  
  15.     {  
  16.         ec_.notify_one();  
  17.     }  
  18.   
  19.     void broadcast()  
  20.     {  
  21.         ec_.notify_all();  
  22.     }  
  23. };   
(note this is a simplified condition variable without all the POSIX compliance crud).

In C++0x you have condition_variable at the stdlib level, so that is probably the best approach for the future. Unfortunately that future is still far away. On Pthreads you also have condition_variable (though a rather more complex one). Unfortunately, on Win32 (pre-Vista) you don't have condition_varaiable at all, so you have to build one of these from OS primitives.

(BTW there are various sources for good condition_var implementations for Win32, such as boost::thread and Win32 pthreads by Alex Terekhov).

ADDENDUM : really eventcount is the more primitive of the two; it's sort of a mistake that C++0x has provided "condition_var" as a primitive. They are not trying to provide a full set of OS-level thread control types (eg. they don't provide semaphore, event, what have you) - they are trying to provide the minimal basic set, and they chose condition_var. They should have done mutex and eventcount, as you can build everything from that.

(actually there's something perhaps even more primitive that eventcount which is "waitset" which can be easily used to build any of the basic blocking thread control devices).

7/06/2011

07-06-11 - Who ordered Condition Variables -

I'm getting back into some low level threading stuff for a week or two and I'll try to write about it because it's very confusing and I always forget the basics.

Condition Variables are explained very strangely around the net. You'll find sites that say they "let you wait on a variable being set to certain value" (not true), "let you avoid polling" (not true), or "the mutex is a left-over from pre-pthreads implementations" (not true).

What a "Condition Variable" really is is a way to receive a Signal and enter a Mutex at the same time ("atomically" if you like).

Why do we want this?

The typical case is we are waiting on some state. When the state is not true, we want our thread to go into a real sleep. So we use an OS Wait() on the thread that's waiting for that state, and an OS Signal() when the state is set to wake the thread. (Wait and Signal might be "Event" on win32, or a semaphore in pthreads, or a futex, etc). Basically :


Waiting thread :

if ( state I want is not set )
{
  Wait(handle);
  // state I want should be set now
  // **


Signalling thread :

if ( I changed state to the one wanted )
  Signal(handle)

simple enough. The problem is, the invariant at (**) is not true. The state you want is NOT necessarily set there, because there's a race. Immediately after you receive the signal, you could be swapped out, and someone else could change the state, and then it would not be what you wanted.

So the obvious thing is to put a mutex around "state". To be less abstract I'll talk about a one item queue (aka a mailbox).


Waiting thread :

Lock mutex;
if ( mailbox empty )
{
  atomically { Unlock(mutex);   Wait(handle); (**) Lock(mutex) }
  // mailbox must be full now
}

Signaling thread :

Lock mutex
if ( mailbox empty )
{
  mailbox = some stuff;
  Signal(handle); Unlock(mutex);
}

now when the mailbox filler signals the handle, the waiter immediately wakes up and tries to lock the mutex and can't; the filler then unlocks and the waiter can run, and it is gauranteed to have an item.

Note that the line marked {atomically} *must* be atomic, otherwise you have a race at (**) just like before.

And this :

  atomically { Unlock(mutex);   Wait(handle); (**) Lock(mutex) }
is exactly what pthread_cond_wait() is.

Personally rather than introducing a new synchronization data type I would have preferred to just get a function that does unlock_wait_lock(). The "cond_var" in pthreads has nothing to do with a "condition"', it's just a waitable handle (and associated mutex and other wiring) in Windows lingo.

There's one more point worth talking about. In the thread that filled the mailbox, we did :


lock mutex
set state
signal
unlock

That's good because it gaurantees that the receiver gets the state it wants and is race free. It's bad because it causes some unnecessary "thread thrashing".

(thread thrashing is a lot like input latency that I mentioned a while ago; it's something you just want to constantly watch and be vigilant about tracking and removing. Any time a thread wakes up and does nothing and goes back to sleep, you are "thrashing" and just wasting massive amounts of CPU. You want to minimize the number of useless thread wakeups)

The alternative is :


lock mutex
set state
unlock
(**)
signal

now, there's a race a ** where the state can get changed before you signal, so the invariant in the receiver is no longer true.

In most cases, however, this is not actually bad, and this form is actually preferred because of its increased efficiency. You have to change the receiver to a "double-check" type of pattern. Something like :


Waiting thread :

Lock mutex;
if ( mailbox empty )
{
  retry:
  cond_wait(mutex,handle); //  atomically { Unlock(mutex); Wait(handle); Lock(mutex) }
  // mailbox may or may not be full now
  if ( mailbox empty ) goto retry;
  // now do work
}

Signaling thread :

Lock mutex
if ( mailbox empty )
{
  mailbox = some stuff;
  Unlock(mutex);
  // intentional race here
  Signal(handle);
}

In general I believe it's a safer design to treat the signal as meaning "wake up and check this condition" instead of "wake up and this condition is definitely set". Then you engineer to minimize the number of wakeups when the condition is not set.

BTW a better design of the primitive would have allowed the signalling thread to do


atomically { Unlock(mutex); Signal(handle); }

which would be the ideal thing. Unfortunately a normal cond_var is not expressed this way. However, apparently on some modern UNIXes cond_var actually *acts* this way. What you do is signal inside the mutex, but the other thread isn't actually woken up until you unlock the mutex. Unfortunately this is a hidden optimization (they should have just provided unlock_and_signal() as one call) and you can't rely on it if you're cross-platform.

Some links on this topic :
Usenet - Condition variables signal with or without mutex locked
condvars signal with mutex locked or not Lo�c OnStage
A word of caution when juggling pthread_cond_signalpthread_mutex_unlock - comp.programming.threads Google Groups

6/28/2011

06-28-11 - String Extraction

So I wrote a little exe to extract static strings from code. It's very simple.

StringExtractor just scans some dirs of code and looks for a tag which encloses a string with parens. eg. :

    MAKE_STRING( BuildHuff )
it takes all the strings it finds in that way and makes a table of indexes and contents, like :


enum EStringExtractor
{
    eSE_Null = 0,
    eSE_BuildHuff = 1,
    eSE_DecodeOneQ = 2,
    eSE_DecodeOneQ_memcpy = 3,
    eSE_DecodeOneQ_memset = 4,
    eSE_DecodeOneQuantum = 5, ...


const char * c_strings_extracted[] = 
{
    0,
    "BuildHuff",
    "DecodeOneQ",
    "DecodeOneQ_memcpy",
    "DecodeOneQ_memset",
    "DecodeOneQuantum", ...

it outputs this to a generated .C and .H file, which you can then include in your project.

The key then is what MAKE_STRING means. There are various ways to set it up, depending on whether you are replacing an old system that uses char * everywhere or not. Basically you want to make a header that's something like :


#if DO_STRINGS_RAW

#define MAKE_STRING( label )   (string_type)( #label )
#define GET_STRING(  index )   (const char *)( index )

#else

#include "code_gen_strings.h"

#define MAKE_STRING( label )  (string_type)( eSE_ ## label )
#define GET_STRING(  index )  c_strings_extracted[ (int)( index ) ]

#endif

(string_type can either be const char * to replace an old system, or if you're doing this from scratch it's cleaner to make it a typedef).

If DO_STRINGS_RAW is on, you run with the strings in the code as normal. With DO_STRINGS_RAW off, all static strings in the code are replaced with indexes and the table lookup is used.

It's important to me that the code gen doesn't actually touch any of the original source files, it just makes a file on the side (I hate code gen that modifies source because it doesn't play nice with editors); it's also important to me that you can set DO_STRINGS_RAW and build just fine without the code gen (I hate code gen that is a necessary step in the build).

Now, why would you do this? Well, for one thing it's just cleaner to get all static strings in one place so you can see what they are, rather than having them scattered all over. But some real practical benefits :

You can make builds that don't have the string table; eg. for SPU or other low-memory console situations, you can run the string extraction to turn strings into indeces, but then just don't link the table in. Now they can send back indeces to the host and you can do the mapping there.

You can load the string table from a file rather than building it in. This makes it optional and also allows localization etc. (not a great way to do this though).

For final builds, if you are using these strings just for debug info, you can easily get rid of all of them in one place just by #defining MAKE_STRING and GET_STRING to nulls.

Anyhoo, here's the EXE :

stringextractor.zip (84k)

(stringextractor is also the first cblib app that uses my new standardized command line interface; all cblib apps in the future will have a common set of -- options; also almost all cblib apps now take either files or dirs on the command line and if you give them dirs they iterate on contents).

(stringextractor also importantly uses a helper to not change the output file if the contents don't change; this means that it doesn't mess up the modtime of the generated file and cause rebuilds that aren't necessary).

Obviously one disadvantage is you can't have spaces or other non-C-compatible characters in the string. But I guess you could fix this by using printf style codes and do printf when you generate the table.

old rants