4/10/2013

04-10-13 - Waitset Resignal Notes

I've been redoing my low level waitset and want to make some notes. Some previous discussion of the same issues here :

cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores

In particular, two big things occurred to me :

1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details, but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check waitset like this :


{
waiter = waitset.prepare_wait(condition);

if ( double check )
{
    waiter.cancel();
}
else
{
    waiter.wait();
    // possibly loop and re-check condition
}

}

then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of another thread that did need that signal can be missed.

Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad. In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above, it doesn't need to know about the fact that the resignal mechanism is happening at all.

So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack came to me when I thought about under what conditions this actually happens. If you recall from the earlier posts, the need for resignal comes from situations like :


T0 posts sem , and signals no one
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2

the thing is this can only happen if all the threads are awake and racing against each other (it requires a very specific interleaving); that is, the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed from :

T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait

to :

T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait

that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty rare race, so really not a bad thing.

The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the condition you want right at the same time that you check it, so your first check fails and after you prepare_wait, your second check passes. The resignal only occurs if you are in that race path, and also the setting thread sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives you that.

The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting this as standard practice in waitset.

2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.

Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages and the scheme is now :


{
waiter w;
waitset : send message { prepare_wait , &w, condition };

if ( double check )
{
    waitset : send message { cancel , &w };
    return;
}

w.wait();
}

-------

{
waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"
}

The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to ensure the consistency of the wait list.

This greatly reduces contention in the most common usage patterns :


Mutex :

only the mutex owner does Notify
 - so contention of the waitset lock is non-existant
many threads may try to lock a mutex
 - they do not have any waitset-lock contention

Semaphore :

the common case of one producer and many consumers (lots of threads do wait() )
 - zero contention of the waitset lock

the less common case of many producers and few consumers is slow

Another way to look at it is instead of doing little bits of waitlist maintenance in three different places (prepare_wait,notify,cancel) which each have to take a lock on the list, all the maintenance is moved to one spot.

Now there are some subtleties.

If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle needed per thread and you can use that to implement every threading primitive. But now you have to be able to recycle the waiter. Note that you don't have to worry about other threads using your waiter; the waiter is per-thread so you just have to worry about when you come around and use it again yourself.

If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free wait side there are some issues.

First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time (it was just a request). So if you come back around on the same thread and call prepare() again, prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have to force the Notify-side list maintenance to be done immediately.

The second related issue is that the lock-free wait-side can give you spurious signals to your waiter. Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist, you might actually still be in the waitlist from the previous time you used your waiter. Thus you can get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals (I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.


Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.

You may recall we've looked at a simple waitset-based mutex before :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,1) != 0 )
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel();
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 )
    {
        waitset.NotifyAll( &thinlock );
    }
}
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop for efficiency.

This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.

(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic" (as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))

Now that we have NotifyOne we can fix that mutex :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,2) != 0 ) // (*1)
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel(waitset_resignal_no); // (*2)
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 ) // (*3)
    {
        waitset.NotifyOne( &thinlock );
    }
}
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons (see cbloom rants 07-15-11 - Review of many Mutex implementations , where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne. (*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1 (actually it's exactly like it for non-recursive mutexes); you only need to pass on the signal when it's possible that some other thread needs to know about it. (*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing. TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.

Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :


U32 thinlock; // = 0 initializes thinlock

Lock :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinlock );

    if ( Fetch_Add(&thinlock,1) == 0 )
    {
        // got the lock (no need to resignal)
        w.Cancel(waitset_resignal_no);
        return;
    }

    w.Wait();
    // woke up - I have the lock !
}

Unlock :
{
    if ( Fetch_Add(&thinlock,-1) > 1 )
    {
        // there were waiters
        waitset.NotifyOne( &thinlock );
    }
}
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait). This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to; because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on the last unlock whether or not it needs to (eg. it will always do one Notify too many).

This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :


U32 thinsem; // = 0 initializes thinsem

Wait :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinsem );

    if ( Fetch_Add(&thinsem,-1) > 0 )
    {
        // got a dec on count
        w.Cancel(waitset_resignal_yes); // (*1)
        return;
    }

    w.Wait();
    // woke up - I got the sem !
}

Post :
{
    if ( Fetch_add(&thinsem,1) < 0 )
    {
        waitset.NotifyOne( &thinsem );
    }
}

which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal, because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time; and the mutex itself serializes the unlocks).


Oh, one very subtle issue that I only discovered due to relacy :

waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard pattern for any kind of "Publish" is something like :


Publish
{
    change shared variable
    if ( any waiters )
    {
        #StoreLoad
        waitset.Notify()
    }
}

Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do some kind of more weakly ordered primitive, then you must force a barrier there.

This is the exact same issue that I've run into before and forgot about again :

cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3

No comments:

old rants