11-28-11 - Some lock-free rambling

It helps me a lot to write this stuff down, so here we go.

I continually find that #StoreLoad scenarios are confusing and catch me out. Acquire (#LoadLoad) and Release (#StoreStore) are very intuitive, but #StoreLoad is not. I think I've covered almost this exact situation again, but this stuff is difficult so it's worth revisiting many times. (I find low level threading to be cognitively a lot like quantum mechanics, in that if you do it a lot you become totally comfortable with it, but if you stop doing it even for a month it is super confusing and bizarre when you come back to it, and you have to re-work through all the basics to convince yourself they are true).

(Aside : fucking Google verbatim won't even search for "#StoreLoad" right. Anybody know a web search that is actually verbatim? A whole-word-only option would be nice too, and also a match case option. You know, like basic text search options from like 1970 or so).

The classic case for needing #StoreLoad is WFMO. The very simple scenario goes like this :

bool done1 = false;
bool done2 = false;

// I want to do X() when done1 & done2 are both set.


done1 = true;
if ( done1 && done2 )


done2 = true;
if ( done1 && done2 )

This doesn't work.

Obviously Thread1 and Thread2 can run in different orders so done1 and done2 become set in random order. But one thread or the other should see them both set. But they don't; the reason is that the memory visibility can be reordered. This is a pretty clear illustration of the thing that trips up many people - threads can interleave both in execution order and in memory visibility order.

In particular the bad execution case goes like this :

done1 = false, done2 = false

T1 sets done1 = true
  T1 sees done1 = true (of course)
  T2 still sees done1 = false (store is not yet visible to him)

T2 sets done2 = true
  T2 sees done2 = true
  T1 still sees done2 = false

T1 checks done2 for (done1 && done2)
  still sees done2 = false
  doesn't call X()

T2 checks done1
  still sees done1 = false
  doesn't call X()

T1 sees done2=true
T2 sees done1=true

when you write it out it's obvious that the issue is the store visibility is not forced to occur before the load. So you can fix it with :


done1 = true;
if ( done1 && done2 )

As noted previously there is no nice way to make a StoreLoad barrier in C++0x. The best method I've found is to make the loads into fetch_add(0,acq_rel) ; that works by making the loads also be stores and using a #StoreStore barrier to get store ordering. (UPDATE : using atomic_thread_fence(seq_cst) also works).

The classic simple waitset that we have discussed previously is a bit difficult to use in more complex ways.

Refresher : A waitset works with a double-check pattern, like :

signalling thread :

set condition

waiting thread :

if ( ! condition )

    // double check :
    if ( condition )

we've seen in the past how you can easily build a condition var or an eventcount from waitset. In some sense waitset is a very low level primitive and handy for building higher level primitives from. Now on to new material.

You can easily use waitset to perform an "OR" WFMO. You simply add yourself to multiple waitsets. (you need a certain type of waitset for this which lets you pass in the primitive that you want to use for waiting). To do this we slightly extend the waitset API. The cleanest way is something like this :

instead of prepare_wait :

waiter create_waiter();
void add_waiter( waiter & w );

instead of wait/cancel_wait :

~waiter() does cancel/retire wait 
waiter.wait() does wait :

Then an OR wait is something like this :

signal thread 1 :

set condition1

signal thread 2 :

set condition2

waiting thread :

if ( condition1 ) // don't wait

waiter w = waitset1.create_waiter();

// double check condition1 and first check condition2 :

if ( condition1 || condition2 ) // don't wait
  // ~w will take you out of waitset1


// double check :

if ( condition2 ) // don't wait

// I'm now in both waitset1 and waitset2

Okay. This works fine. But there is a limitation which might not be entirely obvious.

I have intentionally not made it clear if the notify() in the signalling threads is a notify_one (signal) or notify_all (broadcast). Say you want it to be just notify_one , because you don't want to wake more threads than you need to. Say you have this scenario :

X = false;
Y = false;

X = true;

Y = true;

wait for X || Y

wait for X || Y

this is a deadlock. The problem is that both of the waiter threads can go to sleep, but the two notifies might both go to the same thread.

This is a general difficult problem with waitset and is why you generally have to use broadcast (for example eventcount is built on waitset broadcasting).

You may think this is an anomaly of trying to abuse waitset to do an OR, but it's quite common. For example you might try to do something seemingly simple like build semaphore from waitset.

class semaphore_from_waitset
    waitset_simple m_waitset;
    std::atomic<int> m_count;

    semaphore_from_waitset(int count = 0)
    :   m_count(count), m_waitset()
        RL_ASSERT(count >= 0);


    void post()
        // broadcast or signal :
        // (*1)

    bool try_wait()
        // see if we can dec count before preparing the wait
        int c = m_count($).load(mo_acquire);
        while ( c > 0 )
            if ( m_count($).compare_exchange_weak(c,c-1,mo_acq_rel) )
                return true;
            // c was reloaded
        return false;

    void wait(HANDLE h)
            if ( try_wait() )
            // no count available, get ready to wait
            // double check :
            if ( try_wait() )
                // (*2)
                // pass on the notify :
            // loop and try again

it's totally straightforward in the waitset pattern, except for the broadcast issue. If *1 is just a notify_one, then at *2 you must pass on the notify. Alternatively if you don't have the re-signal at *2 then the notify at *1 must be a broadcast (notify_all).

Now obviously if you have 10 threads waiting on a semaphore and you inc the count by 1, you don't want all 10 threads to wake up so that just 1 of them can dec the count and get to execute. The re-signal method will wake 2 threads, so it's better than broadcast, but still not awesome.

(note that this is easy to fix if you just put a mutex around the whole thing; or you can implement semaphore without waitset; the point is not to reimplement semaphore in a bone-headed way, the point is just that even very simple uses of waitset can break if you use notify_one instead of notify_all).

BTW the failure case for semaphore_from_waitset with only a notify_one and no resignal (eg. if you get the (*1) and (*2) points wrong) goes like this :

the problem case goes like this :

    T1 : sem.post , sem.post
    T2&T3 : sem.wait

    execution like this :

    T2&3 both check count and see zereo
    T1 now does one inc and notify, noone to notify yet
    T2&3 do prepare_wait
    T2 does its double-check, sees a count and takes it (does not retire yet)
    T3 does its double-check, sees zero, and goes to sleep
    T1 now does the next inc and notify
    -> this is the key problem
    T2 can get the notify because it is still in the waiter list
        (not retired yet)
    but T3 needs the notify

The key point is this spot :

            // double check :
            if ( try_wait() )
                // !! key !!

you have passed the double check and are not going to wait, but you are still in the waiter list. This means you can be the one thread chosen to receive the signal, but you don't need it. This is why resignal works.


won3d said...


Google should do a better job with symbols.

ryg said...

Don't see the deadlock in the first "notify_one" example. Yes, both notifies might go to say Thread 3, so Thread 4 is not woken up, but how is that a deadlock?

I guess the general rule is that if you have N waiters on a waitset and call notify_one N times, you'll wake up <=N threada, not exactly N threads. More precisely, it gives you "atomic(notify_one_waiter) then atomic(remove_waiter)", not "atomic(notify_one_waiter_then_remove)", which would be a much nicer (and more intuitive) primitive. Correct?

cbloom said...

"Don't see the deadlock in the first "notify_one" example. Yes, both notifies might go to say Thread 3, so Thread 4 is not woken up, but how is that a deadlock?"

Eh, the most general definition of deadlock, which I use, is that a thread goes into a permanent wait when you expected it to be woken by something.

(I know that's not standard; the book definition of deadlock specifically refers to mutexes, but really any thread wait event is isomorphic to a mutex)

"More precisely, it gives you "atomic(notify_one_waiter) then atomic(remove_waiter)", not "atomic(notify_one_waiter_then_remove)", which would be a much nicer (and more intuitive) primitive. Correct?"

No, not exactly. My current implementation of notify_one actually does atomically remove the waiter. Which is what can mislead you into thinking that it should work.

The problem in the OR case is that because you have two different waitsets, you notify in waitsetX (which also removes) but that doesn't remove from waitsetY, so then the notify for Y might go to the one that already got the notify for X. If the notify could atomically remove from both waitsets, that might fix that case. (you can fix this by adding a shared variable to the "waiter" class that tracks whether it has received a notify or not, and don't notify guys who have already been notified by yourself or someone else).

In the semaphore case the problem is the time that you are in the waitset but not yet in a wait, when you might abort the wait due to the double-check.

That "limbo" period is crucial to making waitset work (this was discussed in detail in past posts) but it's also what causes the problem for when you want to be sure to wake the minimum number of threads.

Obviously in practice you would just put a mutex around the whole thing, which basically turns your waitset into a condition var and removes the limbo period.

If you like, this is a definition of why eventcount needs to be broadcast.

IvyMike said...

On google, the "Verbatim" search (expand search tools, on the left) does OK.

It doesn't keep the "#" sign, but it does at least keep "StoreLoad" together, and the results seem a lot better.

cbloom said...

The point of that aside is that I specifically want the #. The reason I put the # in my search is because I wanted the #. If I didn't want the # I wouldn't have put it in my search.

Drawknob said...

// c was reloaded

Can you elaborate? How was it reloaded?

Drawknob said...
This comment has been removed by the author.

old rants