I've been redoing my low level waitset and want to make some notes.
Some previous discussion of the same issues here :
cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores
In particular, two big things occurred to me :
1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details,
but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check
waitset like this :
{
waiter = waitset.prepare_wait(condition);
if ( double check )
{
waiter.cancel();
}
else
{
waiter.wait();
// possibly loop and re-check condition
}
}
then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of
another thread that did need that signal can be missed.
Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad.
In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above,
it doesn't need to know about the fact that the resignal mechanism is happening at all.
So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal
that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack
came to me when I thought about under what conditions this actually happens. If you recall from the earlier
posts, the need for resignal comes from situations like :
T0 posts sem , and signals no one
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2
the thing is this can only happen if all the threads are awake and racing against each other (it requires
a very specific interleaving); that is,
the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check
saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you
wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed
from :
T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait
to :
T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait
that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty
rare race, so really not a bad thing.
The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the
condition you want right at the same time that you check it, so your first check fails and after you prepare_wait,
your second check passes. The resignal only occurs if you are in that race path, and also the setting thread
sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal
that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or
elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives
you that.
The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting
this as standard practice in waitset.
2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.
Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages
and the scheme is now :
{
waiter w;
waitset : send message { prepare_wait , &w, condition };
if ( double check )
{
waitset : send message { cancel , &w };
return;
}
w.wait();
}
-------
{
waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"
}
The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to
ensure the consistency of the wait list.
This greatly reduces contention in the most common usage patterns :
Mutex :
only the mutex owner does Notify
- so contention of the waitset lock is non-existant
many threads may try to lock a mutex
- they do not have any waitset-lock contention
Semaphore :
the common case of one producer and many consumers (lots of threads do wait() )
- zero contention of the waitset lock
the less common case of many producers and few consumers is slow
Another way to look at it is instead of doing little bits of waitlist maintenance in three
different places (prepare_wait,notify,cancel) which each have to take a lock on the list,
all the maintenance is moved to one spot.
Now there are some subtleties.
If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't
want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle
needed per thread and you can use that to implement every threading primitive. But now you have
to be able to recycle the waiter. Note that you don't have to worry about other threads using your
waiter; the waiter is per-thread so you just have to worry about when you come around and use it
again yourself.
If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free
wait side there are some issues.
First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time
(it was just a request). So if you come back around on the same thread and call prepare() again,
prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have
to force the Notify-side list maintenance to be done immediately.
The second related issue is that the lock-free wait-side can give you spurious signals to your waiter.
Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the
signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist,
you might actually still be in the waitlist from the previous time you used your waiter. Thus you can
get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals
(I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another
is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.
Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.
You may recall we've looked at a simple waitset-based mutex before :
U32 thinlock;
Lock :
{
// first check :
while( Exchange(&thinlock,1) != 0 )
{
waiter w; // get from TLS
waitset.PrepareWait( &w, &thinlock );
// double check and put in waiter flag :
if ( Exchange(&thinlock,2) == 0 )
{
// got it
w.Cancel();
return;
}
w.Wait();
}
}
Unlock :
{
if ( Exchange(&thinlock,0) == 2 )
{
waitset.NotifyAll( &thinlock );
}
}
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop
for efficiency.
This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify
if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and
then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common
type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.
(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic"
(as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))
Now that we have NotifyOne we can fix that mutex :
U32 thinlock;
Lock :
{
// first check :
while( Exchange(&thinlock,2) != 0 ) // (*1)
{
waiter w; // get from TLS
waitset.PrepareWait( &w, &thinlock );
// double check and put in waiter flag :
if ( Exchange(&thinlock,2) == 0 )
{
// got it
w.Cancel(waitset_resignal_no); // (*2)
return;
}
w.Wait();
}
}
Unlock :
{
if ( Exchange(&thinlock,0) == 2 ) // (*3)
{
waitset.NotifyOne( &thinlock );
}
}
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately
exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons
(see
cbloom rants 07-15-11 - Review of many Mutex implementations ,
where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne.
(*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that
there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1
(actually it's exactly like it for non-recursive mutexes);
you only need to pass on the signal when it's possible that some other thread needs to know about it.
(*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing.
TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock
only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.
Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :
U32 thinlock; // = 0 initializes thinlock
Lock :
{
waiter w; // get from TLS
waitset.PrepareWait( &w, &thinlock );
if ( Fetch_Add(&thinlock,1) == 0 )
{
// got the lock (no need to resignal)
w.Cancel(waitset_resignal_no);
return;
}
w.Wait();
// woke up - I have the lock !
}
Unlock :
{
if ( Fetch_Add(&thinlock,-1) > 1 )
{
// there were waiters
waitset.NotifyOne( &thinlock );
}
}
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait).
This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to;
because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on
the last unlock whether or not it needs to (eg. it will always do one Notify too many).
This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :
U32 thinsem; // = 0 initializes thinsem
Wait :
{
waiter w; // get from TLS
waitset.PrepareWait( &w, &thinsem );
if ( Fetch_Add(&thinsem,-1) > 0 )
{
// got a dec on count
w.Cancel(waitset_resignal_yes); // (*1)
return;
}
w.Wait();
// woke up - I got the sem !
}
Post :
{
if ( Fetch_add(&thinsem,1) < 0 )
{
waitset.NotifyOne( &thinsem );
}
}
which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal,
because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time;
and the mutex itself serializes the unlocks).
Oh, one very subtle issue that I only discovered due to relacy :
waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard
pattern for any kind of "Publish" is something like :
Publish
{
change shared variable
if ( any waiters )
{
#StoreLoad
waitset.Notify()
}
}
Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that
is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do
some kind of more weakly ordered primitive, then you must force a barrier there.
This is the exact same issue that I've run into before and forgot about again :
cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3