8/12/2011

08-12-11 - The standard cinit trick

Sometimes I like to write down standard tricks that I believe are common knowledge but are rarely written down.

Say you have some file that does some "cinit" (C++ class constructors called before main) time work. A common example is like a factory that registers itself at cinit time.

The problem is if nobody directly calls anything in that file, it will get dropped by the linker. That is, if all uses are through the factory or function pointers or something like that, the linker doesn't know it gets called that way and so drops the whole thing out.

The standard solution is to put a reference to the file in its header. Something like this :


Example.cpp :

int example_cpp_force = 0;

AT_STARTUP( work I wanted to do );


Example.h :

extern int example_cpp_force;

AT_STARTUP( example_cpp_force = 1 );

where AT_STARTUP is just a helper that puts the code into a class so that it runs at cinit, it looks like this :

#define AT_STARTUP(some_code)   \
namespace { static struct STRING_JOIN(AtStartup_,__LINE__) { \
STRING_JOIN(AtStartup_,__LINE__)() { some_code; } } STRING_JOIN( NUMBERNAME(AtStartup_) , Instance ); };

Now Example.obj will be kept in the link if any file that includes Example.h is kept in the link.

This works so far as I know, but it's not really ideal (for one thing, if Example.h is included a lot, you get a whole mess of little functions doing example_cpp_force = 1 in your cinit). This is one of those dumb little problems that I wish the C standards people would pay more attention to. What we really want is a way within the code file to say "hey never drop this file from link, it has side effects", which you can do in certain compilers but not portably.

8/11/2011

08-11-11 - Free Internet

I mean "free" in a liberty sense, not a monetary sense.

Recent Seattle Weekly article got me thinking about trying to encrypt and anonymize all my internet access. The whole torrent model is just like fish in a barrel for copyright trolls. You can just hop on the net and get a list of infringers any time you want.

So whatever reason, say you want to be able to work on the net and do as you please without your actions being monitored.

Apparently the major US-based services like FindNot and Anonymizer are not to be trusted (they provide logs to the US government and to subpoenas by the RIAA etc).

Really what you want is something like Tor that takes all your traffic and bounces it around a bunch of other machines and then puts out portions of requests from all over. Currently none of those services seem to be quite ready for prime time; Tor for example kicks you out if you try to do high-bandwidth things like torrents.

Some links :

Web-based DNS Randomness Test DNS-OARC
Tor Project Anonymity Online
SwissVPN - Surf the safer way!
Public IP Swiss VPN - Page 2 - Wilders Security Forums
OneSwarm - Private P2P Data Sharing
I2P - Wikipedia, the free encyclopedia
How To Not Get Sued for File Sharing Electronic Frontier Foundation
Free Anonymous BitTorrent Becomes Reality With BitBlinder TorrentFreak
Chilling Effects Clearinghouse
Anonymous P2P - Wikipedia, the free encyclopedia

In general I'm not sure if dark-nets like Tor can survive. I don't trust the internet providers or the US government to allow you to have that freedom. I suspect that if they ever caught on en masse they would be blocked by the standard extra-judicial mechanisms that they used to shut down online poker and funding WikiLeaks (where the government nicely asks the service provider to block that traffic and the provider complies, even though it's not clear the law is on their side).

The only way to get past that (and into places like china) is to hide encrypted packets inside benign packets. That may be fine for little text messages, but you can never get high bandwidth that way.

8/09/2011

08-09-11 - Threading Links

For reference, some of the links I consulted for the recent postings :

[concurrency-interest] fast semaphore
[C++] Chris M Thomasson - Pastebin.com
[C++] Chris M Thomasson - Pastebin.com -rwmutex eventcount
[C++] Chris M Thomasson - Pastebin.com - wsdequeue
[C++] Chris M Thomasson - Pastebin.com - semaphore and mpmc
[C++] Chris M Thomasson - Pastebin.com - mpsc in relacy
[C++] Chris M Thomasson - Pastebin.com - eventcount from cond_Var
[C++] Chris M Thomasson - Pastebin.com - cond_Var from waitset
[C#] Chris M Thomasson - Pastebin.com - eventcount in C#
yet another win32 condvar implementation - comp.programming.threads Computer Group
yet another (tiny) implementation of condvars - comp.programming.threads Google Groups
Would this work on even one platform - about mutex reordering
Windows NT Keyed Events
Win32 Kernel Experimental WaitLock-Free Fast-Path Event-Count for Windows... anZ2dnUVZ, InterlockedLoadFence, and aPOdnXp1l6
win32 condvar futex - NOT! - 29464
Win32 condition variables redux - Thomasson thread list version
Win32 condition variables redux - comp.programming.threads Google Groups
Usenet - Lock-free queue SPMC + MPMC
Usenet - Condition variables signal with or without mutex locked
Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
ThreadSanitizer - data-race-test - ThreadSanitizer is a Valgrind-based detector of data races - Race detection tools and mor
Thin Lock vs. Futex � ��Bartosz Milewski's Programming Cafe
The Inventor of Portable DCI-aka-DCL (using TSD) is... ;-) - comp.programming.threads Google Groups
TEREKHOV - Re win32 conditions sem+counter+event = broadcast_deadlock + spur.wake
TBB Thomasson's MPMC
TBB Thomasson - rwmutex
TBB Thomason aba race
TBB Raf on spinning
TBB eventcount posting Dmitry's code
TBB Download Versions
TBB Dmitry on memory model
Task Scheduling Strategies - Scalable Synchronization Algorithms Google Groups
Subtle difference between C++0x MM and other MMs - seq_cst fence weird
Strong Compare and Exchange
Strategies for Implementing POSIX Condition Variables on Win32
Starvation-free, bounded- ... - Intel� Software Network
spinlocks XXXKSE What to do
Spinlocks and Read-Write Locks
SourceForge.net Repository - [relacy] Index of relacy_1_0rrdinttbb_eventcount
Some notes on lock-free and wait-free algorithms Ross Bencina
So what is a memory model And how to cook it - 1024cores
Sleeping Read-Write Locks
Simple condvar implementation for Win32 - comp.programming.threads Google Groups
Simple condvar implementation for Win32 (second attempt)
SignalObjectAndWait Function (Windows)
sequential consistency � Corensic
search for Thomasson - Pastebin.com
search for Relacy - Pastebin.com
sched_setscheduler
Scalable Synchronization
Scalable Synchronization MCS lock
Scalable Synchronization Algorithms Google Groups
Scalable Queue-Based Spin Locks with Timeout
Relacy Race Detector - 1024cores
really simple portable eventcount... - comp.programming.threads Google Groups
really simple portable eventcount... - 2
really simple portable eventcount... - 1
re WaitForMultipleObjects emulation with pthreads
Re sem_post() and signals
Re Portable eventcount (try 2)
Re Intel x86 memory model question
Re C++ multithreading yet another Win32 condvar implementation
race-condition and sub-optimal performance in lock-free queue ddj code...
Race in TBB - comp.programming.threads Google Groups
QPI Quiescence (David Dice's Weblog)
pthread_yield() vs. pthread_yield_np()
pthread_cond_ implementation questions - comp.programming.threads Google Groups
POSIX Threads (pthreads) for Win32
Porting of Win32 API WaitFor to Solaris Platform
Portable eventcount
Portable eventcount - Scalable Synchronization Algorithms Google Groups
Portable eventcount - comp.programming.threads Google Groups
Portable eventcount (try 2) - comp.programming.threads Google Groups
Parallel Disk IO - 1024cores
Obscure Synchronization Primitives
New implementation of condition variables on win32
my rwmutex algorithm for Linux... - this is good
Mutexes and Condition Variables using Futexes
Multithreading in C++0x part 1 Starting Threads Just Software Solutions - Custom Software Development and Website Developmen
Multithreaded File IO Dr Dobb's Journal
Multi-producermulti-consumer SEH-based queue � Intel Software Network Blogs - Intel� Software Network
MSDN Compound Synchronization Objects
MPMC Unbounded FIFO Queue w 1 CASOperation. No jokes. - comp.programming.threads Computer Group
Memory Consistency Models
low-overhead mpsc queue - Scalable Synchronization Algorithms
Lockless Inc Articles on computer science and optimization.
Lockingunlocking SysV semaphores - comp.unix.programmer Google Groups
Lockfree Algorithms - 1024cores
lock-free read-write locks - comp.programming.threads Google Groups
Lock-free bounded fifo-queue on top of vector - comp.programming.threads Google Groups
Linux x86 ticket spinlock
JSS Petersons
JSS Dekker
Joe Seighs awesome rw-spinlock with a twist; the beauty of eventcounts... - comp.programming.threads Google Groups
joe seigh on eventcount fences
Joe Seigh Fast Semaphore
Joe Duffy's Weblog - keyed events
Implementing a Thread-Safe Queue using Condition Variables (Updated) Just Software Solutions - Custom Software Development a
How to use priority inheritance
High-Performance Synchronization for Shared-Memory Parallel Programs University of Rochester Computer Science
good discussion of work stealing
good discussion of a broken condvar implementation
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
GCC-Inline-Assembly-HOWTO
futex(2) - Linux manual page
FlushProcessWriteBuffers Function (Windows)
First Things First - 1024cores
Fine-grained condvareventcount
fast-pathed mutex with eventcount for the slow-path... - comp.programming.threads Google Groups
experimental fast-pathed rw-mutex algorithm... - comp.programming.threads Google Groups
eventcount needs storeload
eventcount example of seq_cst fence problem
Effective Go - The Go Programming Language
duffy page that's down meh
Dr. Dobb's Journal Go Parallel QuickPath Interconnect Rules of the Revolution Dr. Dobb's and Intel Go Parallel Programming
Don�t rely on memory barriers for synchronization� Only if you don�t aware of Relacy Race Detector! � Intel Software Network
dmitry's eventcount for TBB
Distributed Reader-Writer Mutex - 1024cores
Discussion of Culler Singh sections 5.1 - 5.3
Developing Lightweight, Statically Initializable C++ Mutexes Dr Dobb's Journal
Derevyago derslib mt_threadimpl.cpp Source File
Derevyago - C++ multithreading yet another Win32 condvar implementation - comp.programming.threads Google Groups
Dekker's algorithm - Wikipedia, the free encyclopedia
David's Wikiblog
data-race-test - Race detection tools and more - Google Project Hosting
condvars signal with mutex locked or not Lo�c OnStage
Concurrent programming on Windows - Google Books
concurrency-induced memory-access anomalies - comp.std.c Google Groups
CONCURRENCY Synchronization Primitives New To Windows Vista
comp.programming.threads Google Groups
comp.lang.c++ Google Groups - thomasson event uses
Common threads POSIX threads explained, Part 3
Chris M. Thomasson - Pastebin.com
Chris M. Thomasson - Pastebin.com - win_condvar
Chapter�22.�Thread - Boost 1.46.1
cbloom rants 07-18-10 - Mystery - Does the Cell PPU need Memory Control -
cbloom rants 07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -
Causal consistency - Wikipedia, the free encyclopedia
C++1x lock-free algos and blocking - comp.lang.c++ Google Groups
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups
C++0x memory_order_acq_rel vs memory_order_seq_cst
C++ native-win32 waitset class for eventcount... - comp.programming.threads Google Groups
C++ native-win32 waitset class for eventcount... - broken for condvar
C++ N1525 Memory-Order Rationale - nice
C++ multithreading yet another Win32 condvar implementation
Bug-Free Mutexs and CondVars w EventCounts... - comp.programming.threads Google Groups
Break Free of Code Deadlocks in Critical Sections Under Windows
Boost rwmutex 2
Boost rwmutex 1
boost atomics Usage examples - nice
Blog Archive Just Software Solutions - Custom Software Development and Website Development in West Cornwall, UK
Atomic Ptr Plus Project
Asymmetric Dekker
appcoreac_queue_spsc - why eventcount needs fence
AppCore A Portable High-Performance Thread Synchronization Library
Advanced Cell Programming
A word of caution when juggling pthread_cond_signalpthread_mutex_unlock - comp.programming.threads Google Groups
A theoretical question on synchronization - comp.programming.threads Google Groups
A race in LockSupport park() arising from weak memory models (David Dice's Weblog)
A garbage collector for C and C++
A futex overview and update [LWN.net]
A Fair Monitor (Condition Variables) Implementation for Win32

08-09-11 - The Lobster

(this coinage is so obvious I must have stolen it from somewhere, anyway...)

I've been thinking a lot recently about "the lobster".

I've always thought it was bizarre how you can pull into any podunk town in America and go to the scary local diner / steak house, and there will be the regular items - burger, chicken fried steak, what have you, all under $10, and then there's the lobster, for $30, ridiculously overpriced, tucked in the corner of the menu with decorative squiggles around it (as if it needs velvet ropes to separate the VIP section of the menu from the plebian fare).

The thing is, the lobster is not actually good. They probably can't remember the last time anybody actually ordered the lobster. No local would; if the waitress likes you she would warn you not to get, the chefs roll their eyes when the order comes in. Why is it on the menu at all?

I guess it's just there as a trap, for some sucker who doesn't know better, for someone wanting to show off the money they just won, or someone on an expense account to waste money on. You're really just humiliating yourself when you order it, and the restaurant is laughing at you.

I think most people know that you don't actually ever order the lobster in restaurants (other than lobster-specializing places in like Maine or something). But "the lobster" can pop up in many other guises. Expensive watches are obvious lobsters, expensive cars can be less obvious lobsters (is a Maserati a lobster? an Alfa? an Aston? a Porsche?), certainly some of the options and special editions are obvious lobsters, for example the recent Porsche "Speedster" special edition that cost $250k and was just a regular Carrera other than a few colored bits, that's clearly a lobster and Porsche laughs and rolls their eyes at the Seinfelds of the world who are stupid enough to buy the Porsche lobster just because it was on the menu with squiggly lines around it.

I feel like a lot of salesmen try to slip the lobster on you when you're not paying attention. Like when the contractor asks if you want your counters in wood or stone or italian marble - hey wait, contractor, that's the lobster! okay, yeah, you got me, I don't even know where to get italian marble but I thought I'd try to slip it in there. Home improvement in general is full of lobsters. Home theatre stores usually carry a lobster; car wheels ("rims") are rife with lobsters.

The thing that makes the nouveau riche so hilarious is they are constantly getting suckered into buying the lobster and then have the stupidity to brag about it. Ooo look at my gold plated boat ; you fool, you bought the lobster, hide your shame!


One of the things that's so satisfying about video games is that you get a clear reward for more work. You kill some monsters, you get experience, you go up a level; you collect 200 gems, now you can buy the red shield, and it is objectively better than the blue shield you had before. It's very simple and satisfying.

Life is not so clear. More expensive things are not always better. Doing more work doesn't necessarily improve your life. This can be frustrating and confusing.

One of the things that makes me lose it is video game designers who think it's a good idea to make games more realistic in this sense, like providing items in the stores that are expensive but not actually very good. No! I don't want to have to try to suss out "the lobster" in the video game blacksmith, you want video game worlds to be an escapist utopia in which it's always clear that spending more money gets you better stuff. (the other thing I can't stand is games that take away your items; god dammit, don't encourage me to do the work for that if you're going to take it away, don't inject the pains of real life into games, it does not make them better!)

8/01/2011

08-01-11 - Non-mutex priority inversion

An issue I don't see discussed much is non-mutex priority inversion.

First a review of mutex priority inversion. A low priority thread locks a mutex, then loses execution. A high priority thread then tries to lock that mutex and blocks. It gives up its time slice, but a bunch of medium priority threads are available to run, so they take all the time and the low priority thread doesn't get to run. We call it "priority inversion" because the high priority thread is getting CPU time as if it was the same as the low priority thread.

Almost all operating systems have some kind of priority-inversion-protection built into their mutex. The usual mechanism goes something like this : when you block on a mutex, find the thread that currently owns it and either force execution to go to that thread immediately, or boost its priority up to the same priority as the thread trying to get the lock. (for example, Linux has "priority inheritance").

The thing is, there are plenty of other ways to get priority inversion that don't involve a mutex.

The more general scenario is : a high priority thread is waiting on some shared object to be signalled ; a low priority thread will eventually signal that object ; medium priority threads take all the time so the low priority thread can't run, and the high priority thread stays blocked.

For example, this can happen with Semaphores, Events, etc. etc.

The difficulty is that in these cases, unlike with mutexes, the OS doesn't know which thread will eventually signal the shared object to let the high priority thread go, so it doesn't know who to boost.

Windows has panic mechanisms like the "balance set manager" which look for any thread which is not waiting on a waitable handle, but is getting no CPU time, then they force it to get some CPU time. This will save you if you are in one of these non-mutex priority-inversions, but it takes quite a long time for that to kick in, so it's really a last ditch panic save, if it happens you regret it.

Sometimes I see people talking about mutex priority inversion as if that's a scary issue; it's really not on any modern OS. But non-mutex priority inversion *is*.

Conclusion : beware using non-mutex thread flow control primitives on threads that are not of equal priority !

08-01-11 - Double checked wait

Something that we have touched on a few times is the "double checked wait" pattern. It goes like this :
consumer :

if ( not available )
{
    prepare_wait();

    if ( not available )
    {
        wait();
    }
    else
    {
        cancel_wait();
    }
}

producer :

make available
signal_waiters();
now, why do we do this? Well, if you did just a naive check like this :

consumer :

if ( not available )
{
    // (*1)
    wait();
}

producer :

make available
signal_waiters();

you have a race. What happens is you check available and see none, so you step in to *1 ; then the producer runs, publishes whatever and signals - but there are no waiters yet so the signal is lost. Then you go into the wait() and deadlock. This is the "lost wakeup" problem.

So, the double check avoids this race. What must the semantics of prepare_wait & wait be for it to work? It's something like this :

Any signal that happens between "prepare_wait" and "wait" must cause "wait" to not block (either because the waitable handle is signalled, or through some other mechanism).

Some implementations of a prepare_wait/wait mechanism may have spurious signals; eg. wait might not block even though you shouldn't really have gotten a signal; because of that you will usually loop in the consumer.

Now let's look at a few specific solutions to this problem :

condition variables

This is the locking solution to the race. It doesn't use double-checked wait, instead it uses a mutex to protect the race; the naive producer/consumer is replaced with :


consumer :

mutex.lock();
if ( not available )
{
    unlock_wait_lock();
}

producer :

mutex.lock();
make available
signal_waiters();
mutex.unlock();

which prevents the race because you hold the mutex in the consumer across the condition check and the decision to go into the wait.

waitset

Any simple waitset can be used in this scenario with a double-checked wait. For example a trivial waitset based on Event is like this :


waitset.prepare_wait :
    add current thread's Event to list of waiters

waitset.wait :
    WaitForSingleObject(my Event)

waitset.signal_waiters :
    signal all events in list of waiters

for instance, "waitset" could be a vector of handles with a mutex protecting access to that vector. This would be a race without the prepare_wait and double checking.

In this case we ensure the double-checked semantics works because the current thread is actually added to the waitset in prepare_wait. So any signal that happens before we get into wait() will set our Event, and our wait() will not actually block us, because the event is already set.

eventcount

Thomasson's eventcount accomplishes the same thing but in a different way. A simplified version of it works like this :


eventcount.prepare_wait :
    return key = m_count

eventcount.wait :
    if ( key == m_count )
        Wait(event)

eventcount.signal_waiters :
    m_count++;
    signal event;

(note : event is a single shared broadcast event here)

in this case, prepare_wait doesn't actually add you to the waitset, so signals don't go to you, but it still works, because if signal was called in the gap, the count will increase and no longer match your key, so you will not do the wait.

That is, it specifically detects the race - it sees "was there a signal between when I did prepare_wait and wait?" , and if so, it doesn't go into the wait. The consumer should loop, so you keep trying to enter the wait until you get to check your condition without a signal firing.

futex

It just occurred to me yesterday that futex is actually another solution to this exact same problem. You may recall - futex does an internal check of your pointer against a value, and only goes into the wait if the value matches.

producer/consumer with futex is like this :


consumer :

if ( value = not_available )
{
    futex_wait(&value,not_available);
}

producer :

value = available
futex_signal(&value);

this may look like just a single wait at a glance, but if we blow out what futex_wait is doing :

consumer :

if ( value == not_available )
{
    //futex_wait(&value,not_available);

    futex_prepare_wait(&value);
    if ( value == not_available )
        futex_commit_wait(&value);
    else
        futex_cancel_wait(&value);
}

producer :

value = available
futex_signal(&value);

we see can clearly see that futex is just double-checked-wait in disguise.

That is, futex is really our beloved prepare_wait -> wait pattern , but only for the case that the wait condition is of the form *ptr == something.


Do we like the futex API? Not really. I mean it's nice that the OS provides it, but if you are designing your own waitset you would never make the API like that. It confines you to only working on single ints, and your condition has to be int == value. A two-call API like "prepare_wait / wait" is much more flexible, it lets you check conditions like "is this lockfree queue empty" which are impossible to do with futex (what you wind up doing is just doing the double-check yourself and use futex just as an "Event", either that or duplicating the condition into an int for futex's benefit (but that is risky, it can race if not done right, so not recommended)).

BTW some of the later extensions of futex are very cool, like bitset waiting and requeue.

08-01-11 - A game threading model

Some random ideas.

There is no "main" thread at all, just a lot of jobs. (there is a "main job" in a sense, that runs once a frame kicks off the other jobs needed to complete that frame)

Run 1 worker thread per core; all workers just run "jobs", they are all interchangeable. This is a big advantage for many reasons; for example if one worker gets swapped out (or some outside process takes over that CPU), the other workers just take over for it, there is never a stall on a specific thread that is swapped out. You don't have to switch threads just to run some job, you can run it directly on yourself. (caveat : one issue is the lost worker problem which we have mentioned before and needs more attention).

You also need 1 thread per external device that can stall (eg. disk IO, GPU IO). If the API's to these calls were really designed well for threading this would not be necessary - we need a thread per device simply to wrap the bad API's and provide a clean one out to the workers. What makes a clean API? All device IO needs to just be enqueue'd immediately and then provide a handle that you can query for results or completion. Unfortunately real world device IO calls can stall the calling thread for a long time in unpredictable ways, so they are not truly async on almost any platform. These threads should be high priority, do almost no CPU work, and basically just act like interrupts.

A big issue is how you manage locking game objects. I think the simplest thing conceptually is to do the locking at "game object" granularity, that may not be ideal for performance but it's the easiest way for people to get it right.

You clearly want some kind of reader/writer lock because most objects are read many more times than they are written. In the ideal situation, each object only updates itself (it may read other objects but only writes itself), and you have full parallelism. That's not always possible, you have to handle cross-object updates and loops; eg. A writes A and also writes B , B writes B and also writes A ; the case that can cause deadlock in a naive system.

So, all game objects are referenced through a weak-reference opaque handle. To read one you do something like :

    const Object * rdlock(ObjectHandle h)
and then rely on C's const system to try to ensure that people aren't writing to objects they only have read-locked (yes, I know const is not ideal, but if you make it a part of your system and enforce it through coding convention I think this is probably okay).

In implementation rdlock internally increments a ref on that copy of the object so that the version I'm reading sticks around even if a new version is swapped in by wrlock.

There are various ways to implement write-lock. In all cases I make wrlock take a local copy of the object and return you the pointer to that. That way rdlocks can continue without blocking, they just get the old state. (I assume it's okay for reads to get one-frame-old data) (see note *). wrunlock always just exchanges in the local object copy into the table. rdlocks that were already in progress still hold a ref to the old data, but subsequent rdlocks and wrlocks will get the new data.

One idea is like this : Basically semi-transactional. You want to build up a transaction then commit it. Game object update looks something like this :

    Transaction t;
    vector<ObjectHandle> objects_needed;
    objects_needed = self; 
    for(;;)
    {
        wrlock on all objects_needed;

        .. do your update code ..
        .. update code might find it needs to write another object, then do :

        add new_object to objects_needed
        if ( ! try_wrlock( new_object ) )
            continue; // aborts the current update and will restart with new_object in the objects_needed set

        wrunlock all objects locked
        if ( unlocks committed )
            break; // update done
    }

(in actual C++ implementation the "continue" should be a "throw" , and the for(;;) should be try/catch , because the failed lock could happen down inside some other function; also the throw could tell you what lock caused the exception).

There's two sort of variants here that I believe both work, I'm not sure what the tradeoffs are :

1. More mutex like. wrlock is exclusive, only one thread can lock an object at a time. wrunlock at the end of the update always proceeds unconditionally - if you got the locks you know you can just unlock them all, no problem. The issues is deadlock for different lock orders, we handle that with the try_lock, we abort all the locks and go back to the start of the update and retake the locks in a standardized order.

2. More transaction like. wrlock always proceeds without blocking, multiple threads can hold wrlock at the same time. When you wrunlock you check to see that all the objects have the same revision number as when you did the wrlock, and if not then it means some other commit has come in while you were running, so you abort the unlock and retry. So there's no abort/retry at lock time, it's now at unlock time.

In this simplistic approach I believe that #1 is always better. However, #2 could be better if it checked to see if the object was not actually changed (if it's a common case to take a wrlock because you thought you needed it, but then not actually modify the object).

Note that in both cases it helps to separate a game object's mutable portion from its "definition". eg. the things about it that will never change (maybe its mesh, some AI attributes, etc.) should be held to the side somehow and not participate in the wrlock mechanism. This is easy to do if you're willing to accept another pointer chase, harder to do if you want it to just be different portions of the same continuous memory block.

Another issue with this is if the game object update needs to fire off things that are not strictly in the game object transaction system. For example, say it wants to start a Job to do some path finding or something. You can't fire that right away because the transaction might get aborted. So instead you put it in the "Transation t" thing to delay it until the end of your update, and only if your unlocks succeed then the jobs and such can get run.

(* = I believe it's okay to read one frame old data. Note that in a normal sequential game object update loop, where you just do :


for each object
    object->update();

each object is reading a mix of old and new data; if it reads an item in the list before itself, it reads new data, if it reads an item after itself, it reads old data; thus whether it gets old or new data is a "race" anyway, and your game must be okay with that. Any time you absolutely must read the most recent data you can always do a wrlock instead of a rdlock ;

You can also address this in the normal way we do in games, which is separate objects in a few groups and update them in chunks like "phase 1", then "phase2" ,etc. ; objects that are all within the same phase can't rely on their temporal order, but objects in a later phase do know that they see the latest version of the earlier phase. This is the standard way to make sure you don't have one-frame-latency issues.

*).

The big issue with all this is how to ensure that you are writing correct code. The rules are :

1. rdlock returns a const * ; never cast away const

2. game object updates must only mutate data in game objects - they must not mutate global state or anything outside of the limitted transaction system. This is hard to enforce; one way might be to make it absolutely clear with a function name convention which functions are okay to call from inside object updates and which are not.

For checking this, you could set a TLS flag like "in_go_update" when you are in the for {} loop, then functions that you know are not safe in the GO loop can just do ASSERT( ! in_go_update ); which provides a nice bit of safety.

3. anything you want to do in game object update which is not just mutating some GO variables needs to be put into the Transaction buffer so it can be delayed until the commit goes through. Delayed transaction stuff cannot fail; eg. it doesn't get to participate in the retry/abort, so it must not require multiple mutexes that could deadlock. eg. they should pretty much always just be Job creations or destructions that are just pushes/pops from queues.

Another issue that I haven't touched on is the issue of dependencies. A GO update could be dependent on another GO or on a Job completion. You could use the freedom of the scheduling order to reschedule GOs whose dependencies aren't done for later in the tick, rather than stalling.

7/31/2011

07-31-11 - An example that needs seq_cst -

No, not really. I thought I found the great white whale an algorithm that actually needs sequential consistency , but it turned out to be our old friend the StoreLoad problem.

It's worth having a quick look at because some of the issues are ones that pop up often.

I was rolling a user-space futex emulator. To test it I wrote a little mutex. A very simple mutex based on a very simplified futex might look like this :


struct futex_mutex2
{
    std::atomic<int> m_state;

    futex_mutex2() : m_state(0)
    {
    }
    ~futex_mutex2()
    {
    }

    void lock(futex_system * system)
    {
        if ( m_state($).exchange(1,rl::mo_acq_rel) )
        {
            void * h = system->prepare_wait();
            while ( m_state($).exchange(1,rl::mo_acq_rel) )
            {
                system->wait(h);
            }
            system->retire_wait();
        }
    }

    void unlock(futex_system * system)
    {
        m_state($).store(0,rl::mo_release);

        system->notify_one();
    }
};

(note that the actually "futexiness" of it is removed now for simplicity of this test ; also of course you should exchange state to a contended flag and all that, but that hides the problem, so that's removed here).

Then the super-simplified futex system (with all the actual futexiness removed, so that it's just a very simple waitset) is :


//#define MO    mo_seq_cst
#define MO  mo_acq_rel

struct futex_system
{
    HANDLE          m_handle;
    atomic<int>     m_count;

    /*************/
    
    futex_system()
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    
        m_count($).store(0);
    }
    ~futex_system()
    {
        CloseHandle(m_handle);  
    }
        
    void * prepare_wait( )
    {
        m_count($).fetch_add(1,MO);
        
        return (void *) m_handle;
    }

    void wait(void * h)
    {
        WaitForSingleObject((HANDLE)h, INFINITE);
    }

    void retire_wait( )
    {
        m_count($).fetch_add(-1,MO);
    }
    
    void notify_one( )
    {
        if ( m_count($).load(mo_acquire) == 0 ) // mo_seq_cst
            return;
        
        SetEvent(m_handle);
    }
};

So I was finding that it didn't work unless MO was seq_cst (and the load too).

The first point of note is that when I had the full futex system in there which had some internal std::mutexes - there was no bug, the ops on count($) didn't need to be seq cst. That's a common and nasty problem - if you have some ops internally that are seq_cst (such as mutex lock unlock), it can hide the fact that your other atomics are not memory ordered correctly. It was only when I removed the mutexes that the problem revealed itself, but it was actually there all along.

We've discussed this before when we asked "do mutexes need to be seq cst" ; the answer is NO if you just want them to provide mutual exclusion. But if you want them to act like an OS mutex, then the answer is YES. And the issue is that people can write code that is basically relying on the OS mutex being a barrier that provides more than just mutual exclusion.

The next point is that when I reduced the test down to just 2 threads, I still found that I needed seq_cst. That should be a tipoff that the problem does not actually arise from a need for total order. A true seq_cst problem should only show up when you go over 2 threads.

The real problem of course was here :


    void unlock(futex_system * system)
    {
        m_state($).store(0,rl::mo_release);

        //system->notify_one();

        #StoreLoad

        if ( system->m_count($).load(mo_acquire) == 0 ) // mo_seq_cst
            return;
        
        SetEvent(system->m_handle);
    }
};

we just need a StoreLoad barrier there. It should be obvious why we need a StoreLoad there but I'll be very explicit :

same as :

    void unlock(futex_system * system)
    {
        m_state($).store(0,rl::mo_release);

        int count = system->m_count($).load(mo_acquire);

        if ( count == 0 )
            return;
        
        SetEvent(system->m_handle);
    }

same as :

    void unlock(futex_system * system)
    {
        int count = system->m_count($).load(mo_acquire);

        // (*1)

        m_state($).store(0,rl::mo_release);

        if ( count == 0 )
            return;
        
        SetEvent(system->m_handle);
    }

so now at (*1) we have already loaded count and got a 0 (no wiaters); then the other thread trying to lock the mutex sees state == 1, locked, so it incs count and goes to sleep, and we return, and we have a deadlock.

As noted in the first post on this topic, there's no way to express only #StoreLoad in C++0x , so you wind up needing seq cst. Note that the case we cooked up here is almost identical to Thomasson's problem with "event count" so you can read about that :

Synchronization Algorithm Verificator for C++0x - Page 2
really simple portable eventcount... - comp.programming.threads Google Groups
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups
C++0x memory_order_acq_rel vs memory_order_seq_cst
appcoreac_queue_spsc - comp.programming.threads Google Groups

7/30/2011

07-30-11 - A look at some bounded queues - part 2

Okay, let's look into making an MPMC bounded FIFO queue.

We can use basically the same two ideas that we worked up last time.

First let's try to do one based on the read and write indexes being atomic. Consider the consumer; the check for empty now is much more race prone, because there may be another consumer simultaneously reading, which could turn the queue into empty state while you are reading. Thus we need a more single atomic moment to detect "empty" and reserve our read slot.

The most brute-force way to do this kind of thing is always to munge the two variables together. In this case we stick the read & write index into one int together. Now we can atomically check "empty" in one go. We're going to put rdwr in a 32-bit int and use the top and bottom 16 bits for the read index and write index.

So you can reserve a read slot something like this :


    nonatomic<t_element> * read_fetch()
    {
        unsigned int rdwr = m_rdwr($).load(mo_acquire);
        unsigned int rd;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            int wr = rdwr & 0xFFFF;
            
            if ( wr == rd ) // empty
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,rdwr+(1<<16),mo_acq_rel) )
                break;
        }
                
        nonatomic<t_element> * p = & ( m_array[ rd % t_size ] );

        return p;
    }

but this doesn't work by itself. We have succeeded in atomically checking "empty" and reserving our read slot, but now the read index no longer indicates that the read has completed, it only indicates that a reader reserved that slot. For the writer to be able to write to that slot it needs to know the read has completed, so we need to publish the read through a separate read counter.

The end result is this :


template <typename t_element, size_t t_size>
struct mpmc_boundq_1_alt
{
//private:

    // elements should generally be cache-line-size padded :
    nonatomic<t_element>  m_array[t_size];
    
    // rdwr counts the reads & writes that have started
    atomic<unsigned int>    m_rdwr;
    // "read" and "written" count the number completed
    atomic<unsigned int>    m_read;
    atomic<unsigned int>    m_written;

public:

    mpmc_boundq_1_alt() : m_rdwr(0), m_read(0), m_written(0)
    {
    }

    //-----------------------------------------------------

    nonatomic<t_element> * read_fetch()
    {
        unsigned int rdwr = m_rdwr($).load(mo_acquire);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == rd ) // empty
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,rdwr+(1<<16),mo_acq_rel) )
                break;
        }
        
        // (*1)
        rl::backoff bo;
        while ( (m_written($).load(mo_acquire) & 0xFFFF) != wr )
        {
            bo.yield($);
        }
        
        nonatomic<t_element> * p = & ( m_array[ rd % t_size ] );

        return p;
    }
    
    void read_consume() 
    {
        m_read($).fetch_add(1,mo_release);
    }

    //-----------------------------------------------------

    nonatomic<t_element> * write_prepare()
    {
        unsigned int rdwr = m_rdwr($).load(mo_acquire);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == ((rd + t_size)&0xFFFF) ) // full
                return NULL;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,(rd<<16) | ((wr+1)&0xFFFF),mo_acq_rel) )
                break;
        }
        
        // (*1)
        rl::backoff bo;
        while ( (m_read($).load(mo_acquire) & 0xFFFF) != rd )
        {
            bo.yield($);
        }
        
        nonatomic<t_element> * p = & ( m_array[ wr % t_size ] );
        
        return p;
    }
    
    void write_publish()
    {
        m_written($).fetch_add(1,mo_release);
    }

    //-----------------------------------------------------
    
    
};

We now have basically two read counters - one is the number of read_fetches and the other is the number of read_consumes (the difference is the number of reads that are currently in progress). Now we have the complication at the spot :

(*1) : after we reserve a read slot - we will be able to read it eventually, but the writer may not yet be done, so we have to wait for him to do his write_publish and let us know which one is done. Furthermore, we don't keep track of which thread is writing this particular slot, so we actually have to wait for all pending writes to be done. (if we just waited for the write count to increment, a later slot might get written first and we would read the wrong thing)

Now, the careful reader might think that the check at (*1) doesn't work. What they think is :

You can't wait for m_written to be == wr , because wr is just the write reservation count that we saw when we grabbed our read slot. After we grab our read slot, several writes might actually complete, which would make m_written > wr ! And we would infinite loop!

But no. In fact, that would be an issue if this was a real functioning asynchronous queue, but it actually isn't. This queue actually runs in lockstep. The reason is that at (*1) in read_fetch, I have already grabbed a read slot, but not done the read. What that means is that no writers can progress because they will see a read in progress that has not completed. So this case where m_written runs past wr can't happen. If a write is in progress, all readers wait until all the writes are done; then once the reads are in progress, any writers trying to get in wait for the reads to get done.

So, this queue sucks. It has an obvious "wait" spin loop, which is always bad. It also is an example of "apparently lockfree" code that actually acts just like a mutex. (in fact, you may have noticed that the code here is almost identical to a ticket lock - that's in fact what it is!).

How do we fix it? Well one obvious problem is when we wait at *1 we really only need to wait on that particular item, instead of all pending ops. So rather than a global read count and written count that we publish to notify that we're done, we should have a flag or a count in the slot, more like our second spsc bounded queue.

So we'll leave the "rdwr" single variable for where the indexes are, and we'll just wait on publication per slot :


template <typename t_element, size_t t_size>
struct mpmc_boundq_2
{
    enum { SEQ_EMPTY = 0x80000 };

    struct slot
    {
        atomic<unsigned int>    seq;
        nonatomic<t_element>    item;
        char pad[ LF_CACHE_LINE_SIZE - sizeof(t_element) - sizeof(unsigned int) ];
    };

    slot m_array[t_size];

    atomic<unsigned int>    m_rdwr;

public:

    mpmc_boundq_2() : m_rdwr(0)
    {
        for(int i=0;i<t_size;i++)
        {
            int next_wr = i& 0xFFFF;
            m_array[i].seq($).store( next_wr | SEQ_EMPTY , mo_seq_cst );
        }
    }

    //-----------------------------------------------------

    bool push( const t_element & T )
    {
        unsigned int rdwr = m_rdwr($).load(mo_relaxed);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == ((rd + t_size)&0xFFFF) ) // full
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,(rd<<16) | ((wr+1)&0xFFFF),mo_relaxed) )
                break;
        }
        
        slot * p = & ( m_array[ wr % t_size ] );
        
        // wait if reader has not actually finished consuming it yet :
        rl::backoff bo;
        while ( p->seq($).load(mo_acquire) != (SEQ_EMPTY|wr) )
        {
            bo.yield($);
        }
        
        p->item($) = T; 
        
        // this publishes that the write is done :
        p->seq($).store( wr , mo_release );
        
        return true;
    }

    //-----------------------------------------------------
    
    bool pop( t_element * pT )
    {
        unsigned int rdwr = m_rdwr($).load(mo_relaxed);
        unsigned int rd,wr;
        for(;;)
        {
            rd = (rdwr>>16) & 0xFFFF;
            wr = rdwr & 0xFFFF;
            
            if ( wr == rd ) // empty
                return false;
                
            if ( m_rdwr($).compare_exchange_weak(rdwr,rdwr+(1<<16),mo_relaxed) )
                break;
        }
        
        slot * p = & ( m_array[ rd % t_size ] );
        
        rl::backoff bo;
        while ( p->seq($).load(mo_acquire) != rd )
        {
            bo.yield($);
        }
        
        *pT = p->item($);
        
        // publish that the read is done :
        int next_wr = (rd+t_size)& 0xFFFF;
        p->seq($).store( next_wr | SEQ_EMPTY , mo_release );
    
        return true;
    }
    
};

just to confuse things I've changed the API to a more normal push/pop , but this is identical to the first queue except that we now wait on publication per slot.

So, this is a big improvement. In particular we actually get parallelism now, while one write is waiting, another write can go ahead and proceed if the read on its slot is still pending.

"mpmc_boundq_1_alt" suffered from the bad problem that if one reader swapped out during its read, then all writers would be blocked from proceeding (and that would then block all other readers). Now, we no longer have that. If a reader is swapped out, it only blocks the write of that particular slot (and of course blocks if you wrap around the circular buffer).

This is still bad, because you basically have a "wait" on a particular thread and you are just spinning.

Now, if you look at "mpmc_boundq_2" you may notice that the operations on the "rdwr" indexes are actually relaxed memory order - they need to be atomic RMW's, but they actually are not the gate for access and publication - the "seq" variable is now the gate.

This suggests that we could make the read and write indexes separate variables that are only owned by their particular side - like "spsc_boundq2" from the last post , we want to detect the full and empty conditions by using the "seq" variable in the slots, rather than looking across at the reader/writer indexes.

So it's obvious we can do this a lot like spsc_boundq2 ; the reader index is owned only by reader threads; we have to use an atomic RMW because there are now many readers instead of one. Publication and access checking is done only through the slots.

Each slot contains the index of the last access to that slot + a flag for whether the last access was a read or a write :


template <typename t_element, size_t t_size>
struct mpmc_boundq_3
{
    enum { SEQ_EMPTY = 0x80000 };
    enum { COUNTER_MASK = 0xFFFF };

    struct slot
    {
        atomic<unsigned int>    seq;
        nonatomic<t_element>    item;
        char pad[ LF_CACHE_LINE_SIZE - sizeof(t_element) - sizeof(unsigned int) ];
    };

    // elements should generally be cache-line-size padded :
    slot m_array[t_size];
    
    atomic<unsigned int>    m_rd;
    char m_pad[ LF_CACHE_LINE_SIZE ];
    atomic<unsigned int>    m_wr;

public:

    mpmc_boundq_3() : m_rd(0), m_wr(0)
    {
        for(int i=0;i<t_size;i++)
        {
            int next_wr = i & COUNTER_MASK;
            m_array[i].seq($).store( next_wr | SEQ_EMPTY , mo_seq_cst );
        }
    }

    //-----------------------------------------------------

    bool push( const t_element & T )
    {
        unsigned int wr = m_wr($).load(mo_acquire);
        slot * p = & ( m_array[ wr % t_size ] );
        rl::backoff bo;
        for(;;)
        {       
            unsigned int seq = p->seq($).load(mo_acquire);
            
            // if it's flagged empty and the index is right, take this slot :
            if ( seq == (SEQ_EMPTY|wr) )
            {
                // try acquire the slot and advance the write index :
                if ( m_wr($).compare_exchange_weak(wr,(wr+1)& COUNTER_MASK,mo_acq_rel) )
                    break;
            
                // contention, retry
            }
            else
            {
                // (*2)
                return false;
            }
            
            p = & ( m_array[ wr % t_size ] );

            // (*1)
            bo.yield($);
        }

        RL_ASSERT( p->seq($).load(mo_acquire) == (SEQ_EMPTY|wr) );
        
        // do the write :
        p->item($) = T; 
        
        // this publishes it :
        p->seq($).store( wr , mo_release );
        
        return true;
    }

    //-----------------------------------------------------
    
    bool pop( t_element * pT )
    {
        unsigned int rd = m_rd($).load(mo_acquire);
        slot * p = & ( m_array[ rd % t_size ] );
        rl::backoff bo;
        for(;;)
        {       
            unsigned int seq = p->seq($).load(mo_acquire);
            
            if ( seq == rd )
            {
                if ( m_rd($).compare_exchange_weak(rd,(rd+1)& COUNTER_MASK,mo_acq_rel) )
                    break;
            
                // retry
            }
            else
            {
                return false;
            }
            
            p = & ( m_array[ rd % t_size ] );

            bo.yield($);
        }
                            
        RL_ASSERT( p->seq($).load(mo_acquire) == rd );
        
        // do the read :
        *pT = p->item($);
        
        int next_wr = (rd+t_size)& 0xFFFF;
        p->seq($).store( next_wr | SEQ_EMPTY , mo_release );
    
        return true;
    }
    
};

So our cache line contention is pretty good. Only readers pass around the read index; only writers pass around the write index. The gate is on the slot that you have to share anyway. It becomes blocking only when near full or near empty. But all is not roses.

Some notes :

(*1) : the yield loop here might look analogous to before, but in fact you only loop here for RMW contention - that is, this is not a "spin wait" , it's a lockfree-contention-spin.

(*2) : when do we return false here? When the queue is full. But that's not all. We also return false when there is a pending read on this slot. In fact our spin-wait loop still exists and it's just been pushed out to the higher level.

To use this queue you have to do :


while ( ! push(item) ) {
    // wait-spin here
}

which is the spin-wait. The wait on a reader being done with your slot is inherent to these methods and it's still there.

What if we only want to return false when the queue is full, and spin for a busy wait ? We then have to look across at the reader index to check for full vs. read-in-progress. It looks like this :


    bool push( const t_element & T )
    {
        unsigned int wr = m_wr($).load(mo_acquire);
        rl::backoff bo;
        for(;;)
        {
            slot * p = & ( m_array[ wr % t_size ] );
        
            unsigned int seq = p->seq($).load(mo_acquire);
            
            if ( seq == (SEQ_EMPTY|wr) )
            {
                if ( m_wr($).compare_exchange_weak(wr,(wr+1)& COUNTER_MASK,mo_acq_rel) )
                    break;
            
                // retry spin due to RMW contention
            }
            else
            {
                // (*3)
                if ( seq <= wr ) // (todo : doesn't handle wrapping)
                {
                    // full?
                    // (*4)
                    unsigned int rd = m_rd($).load(mo_acquire);
                    if ( wr == ((rd+t_size)&COUNTER_MASK) )
                        return false;
                }
                            
                wr = m_wr($).load(mo_acquire);

                // retry spin due to read in progress
            }
            
            bo.yield($);
        }
        
        slot * p = & ( m_array[ wr % t_size ] );
        
        // wait if reader has not actually finished consuming it yet :
        RL_ASSERT( p->seq($).load(mo_acquire) == (SEQ_EMPTY|wr) );
        
        p->item($) = T; 
        
        // this publishes it :
        p->seq($).store( wr , mo_release );
        
        return true;
    }

which has the two different types of spins. (notes on *3 and *4 in a moment)

What if you try to use this boundedq to make a queue that blocks when it is empty or full?

Obviously you use two semaphores :


template <typename t_element, size_t t_size>
class mpmc_boundq_blocking
{
    mpmc_boundq_3<t_element,t_size> m_queue;

    fastsemaphore   pushsem;
    fastsemaphore   popsem;

public:
    
    mpmc_boundq_blocking() : pushsem(t_size), popsem(0)
    {
    }
    
    void push( const t_element & T )
    {
        pushsem.wait();
    
        bool pushed = queue.push(x);
        RL_ASSERT( pushed );
        
        popsem.post();
    }
    
    void pop( t_element * pT )
    {
        popsem.wait();
        
        bool popped = queue.pop(&x);
        RL_ASSERT( popped );
                
        pushsem.post(); 
    }
    
};

now push() blocks when it's full and pop() blocks when it's empty. The asserts are only correct when we use the modified push/pop that correctly checks for full/empty and spins during contention.

But what's up with the full check? At (*3) we see that the item we are trying to write has a previous write index in it that has not been read. So we must be full already, right? We have semaphores telling us that slots are available or not, why are they not reliable? Why do we need (*4) also?

Because reads can go out of order.


    say the queue was full
    then two reads come in and grab slots, but don't finish their read yet
    then only the second one finishes its read
    it posts the semaphore
    so I think I can write
    I grab a write slot
    but it's not empty yet
    it's actually a later slot that's empty
    and I need to wait for this reader to finish
    
Readers with good memory might remember this issue from our analysis of Thomasson's simple MPMC which was built on two semaphores - and had this exact same problem. If a reader is swapped out, then other readers can post the pushsem, so writers will wake up and try to write, but the first writer can't make progress because its slot is still in use by a swapped out reader.

Note that this queue that we've wound up with is identical to Dmitry's MPMC Bounded Queue .

There are a lot of hidden issues with it that are not at all apparent from Dmitry's page. If you look at his code you might not notice that : 1) enqueue doesn't only return false when full, 2) there is a cas-spin and a wait-spin together in the same loop, 3) while performance is good in the best case, it's arbitrarily bad if a thread can get swapped out.

Despite their popularity, I think all the MPMC bounded queues like this are a bad idea in non-kernel environments ("kernel-like" to me means you have manual control over which threads are running; eg. you can be sure that the thread you are waiting on is running; some game consoles are "kernel-like" BTW).

(in contrast, I think the spsc bounded queue we did last time is quite good; in fact this was a "rule of thumb" that I posted back in the original lockfree series ages ago - whenever possible avoid MPMC structures, prefer SPSC, even multi-plexed SPSC , hell even two-mutex-protected SPSC can be better than MPMC).

7/29/2011

07-29-11 - Spinning

It's important to distinguish the two different reasons for spinning in threaded code; it's unfortunate that we don't have different names for them.

Spinning due to a contended RMW operation is not generally horrible, though this can be a bit subtle.

If you know that the spin can only happen when some other thread made progress, that's the good case. (though you can still livelock if the "progress" that the other thread made is in a loop)

One source of problems is on LL/SC architectures, the lock is usually for the whole cache line. That means somebody else could be fiddling things on your cache line and you might abort your SC without ever making progress. eg. something like :


atomic<int> lock;
int spins; // on same cache line

many threads do :

for(;;)
{
  if ( CAS(x,0,1) ) break;

  spins++;
}

this looks like just a spinlock that counts its spins, but in fact it could be a livelock because the write to "spins" invalidates the cacheline of someone trying to do the CAS, and none of the CAS's ever make progress.

There are also plenty of "lockfree" code snippets posted around the net that are actually more like spin-locks, which we shall discuss next.

The other type of spinning is waiting for some other thread to set some condition.

I believe that this type of spinning basically has no place in games. It really only be used in kernel environments where you can ensure that the thread you are waiting on is current running. If you can't do that, you might be spinning on a thread which is swapped out, which could be a very long spin indeed.

These can come up in slightly subtle places. One example we've seen is this MPMC queue ; the spin-backoff there is actually a wait on a specific, and is thus very bad.

Any time you are actually spinning waiting on a specific thread to do something, you need to actually use an OS Wait on that thread to make sure it gets CPU time.

Relacy's context-bound test will tell you any time you have a spin without a backoff.yield call , so that's nice. Then you have to look at any place you do a backoff.yield and try to see if it's waiting for a thread to set a condition, or only spinning if another thread made progress during that time.

Interestingly a simple spinlock illustrates both types of spins :


for(;;)
{
    if ( lock == 1 )
    {
        // spin wait, bad
        continue;
    }

    if ( ! cas(lock,0,1) )
    {
        // spin contention, okay
        continue;
    }

    break;
}

Sometimes "spin waits" can be well hidden in lockfree code. The question you should ask yourself is :

If one of the threads is swapped out (stops running), is my progress blocked?

If that is true, it's a spin wait, and that's bad.

To be clear, whenever what you are really doing is a "wait" - that is, I need this specific other thread to make progress before I can - you really want to explicitly enforce that; either by using an OS wait, or by knowing what thread it is you're waiting on, etc.

A related topic is that "exponential backoff" type stuff in spins is over-rated. The idea of backoff is that you spin and something like :


first 10 times - mm_pause()
next 5 times - SwitchToThread()
then Sleep(1), then Sleep(2) , etc..

this is sort of okay, because if the other thread is running you act like a spin and make progress quickly, while if the other thread is asleep you will eventually go to sleep and give it some CPU time.

But in reality this is shit. The problem is that if you *ever* actually hit the Sleep(1) , that's a disaster. And of course you could have to sleep much longer. Say for example you have 10 threads and you are waiting on a specific one (without using a proper OS wait). When you Sleep, you might not get the thread you want to run; you might have to sleep all your other threads too to finally get it to go.

It's just sloppy threading programming and the performance is too unpredictable for games.

old rants