I've had some trouble with this for a while; I think I finally have a scheme that really works.
There are two queues :
RTR = ready to run : no dependencies, or dependencies are done
NR = not ready ; dependencies still pending
Each queue has an associated semaphore to count the number of items in it.
The basic work popping that each worker does is something like :
// get all available work without considering sleeping -
while( try_wait( RTR_sem ) )
{
pop RTR_queue and do work
}
// (optionally spin a few times here and check RTR_sem)
// I may have to sleep -
wait( RTR_sem OR NR_sem ); // (*1)
if ( wakeup was from RTR_sem )
{
pop RTR_queue and do work
}
else
{
NRI (not ready item) = pop NR_queue
deps = get dependencies that NRI needs to wait on
wait( deps OR RTR_sem ); // (*3)
if ( wakeup was from RTR_sem )
{
push NRI back on NR_queue and post NR_sem // (*4)
pop RTR_queue and do work
}
else
{
wakeup was because deps are now done
NRI should be able to run now, so do it
(*2)
}
}
*1 : the key primitive here is the ability to do a WFMO OR wait, and to know which one of the items signalled you.
On Windows this is very easy, it's just WaitForMultipleObjects, which returns the guy who woke you. On other platforms
it's trickier and probably involves rolling some of your own mechanisms.
Note that I'm assuming the semaphore Wait() will dec the semaphore at the time you get to run, and the OR wait on multiple semaphores will only dec one of them.
*2 : in practice you may get spurious wakeups or it may be hard to wait on all the dependencies, so you would loop and recheck the deps and possibly wait on them again.
How this differs from my previous system :
My previous system was more of a traditional "work stealing" scheme where each worker had its own queue and would try to just push & pop works from its own queue. This was lower overhead in the fast path (it avoids having a single shared semaphore that they have to contend on, for example), but it had a lot of problems.
Getting workers to go to sleep & wake up correctly in a work stealing scheme is a real mess. It's very hard to tell when you have no work to do, or when you have enough work that you need to wake a new worker, because you don't atomically maintain a work count (eg. a semaphore). You could fix this by making an atomic pair { work items, workers awake } and CAS that pair to maintain it, but that's just a messy way of being a semaphore.
The other problem was what happens when you have dependent work. You want a worker to go to sleep on the dependency, so that it yeilds CPU time, but wakes up when it can run. I had that, but then you have the problem that if somebody else pushes work that can immediately run, you want to interrupt that wait on the dependency and let the worker do the ready work. The semaphore OR wait fixes this nicely.
If you're writing a fractal renderer or some such nonsense then maybe you want to make lots of little work items
and have minimal overhead. But that's a very special purpose rare case. Most of the time it's much more
important that you do the right work when possible. My guiding principles are :
If there is no work that can be done now, workers should go to sleep (yield CPU)
If there is work that can be done now, workers should wake up
You should not wake up a worker and have it go back to sleep immediately
You should not have work available to do but the workers sleeping
Even in the "fractal renderer" sort of case, where you have tons of non-dependent work items, the only penalty here is one extra semaphore dec per item, and that's just a CAS (or a fetch_add) assuming you use something like "fastsemaphore" to fast-path the case of being not near zero count.
There is one remaining issue, which is when there is no ready-to-run work, and the workers are asleep on the first semaphore (they have no work items). Then you push a work item with dependencies. What will happen in the code sketch above is that the worker will wake up, pop the not ready item, then go back to sleep on the dependency. This violates article 3 of the resolution ("You should not wake up a worker and have it go back to sleep immediately").
Basically from *1 to *3 in the code is a very short path that wakes from one wait and goes into another wait; that's always bad.
But this can be fixed. What you need is "wait morphing". When you push a not-ready work item and you go into the semaphore code that is incrementing the NR_sem , and you see that you will be waking a thread - before you wake it up, you take it out of the NR_sem wait list, and put it into the NRI's dependency wait list. (you leave it waiting on RTR_sem).
That is, you just leave the thread asleep, you don't signal it to wake it up, it stays waiting on the same handle, but you move the handle from NR_sem to the dependency. You can implement this a few ways. I believe it could be done with Linux'es newer versions of futex which provide wait morphing. You would have to build your semaphore and your dependency waiting on futex, which is easy to do, then wait morph to transfer the wait. Alternatively if you build them on "waitset" you simply need to move an item from one waitset to the other. This can be done easily if your waitset uses a mutex to protect its internals, you simply lock both mutexes and move the waitable handle with both held.
The net result with wait morphing is very nice. Say for example are you workers are asleep. You create a work item that is dependent on an IO and push it. None of the workers get woken up, but one of them is changed from waiting on work available to waiting on the dependency. When the IO completes it wakes that worker and he runs. If somebody pushed a ton of work in the mean time, all the workers would be woken and they would do that work, and the dependent work would be pushed back on the NR queue and set aside while they did RTR work.
ADDENDUM : at the spot marked (*4) :
push NRI back on NR_queue and post NR_sem // (*4)
pop RTR_queue and do work
In real code you need do something a bit more complex here. What you do is something like :
if ( NRI is ready ) // double check
{
RTR_sem.post() // we woke from RTR_sem , put it back
do NRI work
}
else
{
push NRI onto NR_lifo and post NR_sem
pop RTR_queue and do work
}
we've introduced a new queue , the NR_lifo which is a LIFO (eg. stack). Now whenever you get an NR_sem post, you do :
// NR_sem just returned from wait so I know an NR item is available :
NRI = NR_lifo.pop()
if ( NRI == NULL )
NRI = NR_queue.pop()
the item must be in one or the other and we prefer to take from the LIFO first. Basically the LIFO is a holding area for
items that were popped off the FIFO and were not yet ready, so we want to keep trying to run those before we go back to
the FIFO. You can use a single semaphore to indicate that there is an item in either queue.