11-30-11 - Basic sketch of Worker Thread system with dependencies

You have a bunch of worker threads and work items. Work items can be dependent, on other work items, or on external timed events (such as IO).

I've had some trouble with this for a while; I think I finally have a scheme that really works.

There are two queues :

RTR = ready to run : no dependencies, or dependencies are done

NR = not ready ; dependencies still pending

Each queue has an associated semaphore to count the number of items in it.

The basic work popping that each worker does is something like :

// get all available work without considering sleeping -
while( try_wait( RTR_sem ) )
    pop RTR_queue and do work

// (optionally spin a few times here and check RTR_sem)

// I may have to sleep -

wait( RTR_sem OR NR_sem ); // (*1)

if ( wakeup was from RTR_sem )
    pop RTR_queue and do work
    NRI (not ready item) = pop NR_queue
    deps = get dependencies that NRI needs to wait on

    wait( deps OR RTR_sem ); // (*3)

    if ( wakeup was from RTR_sem )
        push NRI back on NR_queue and post NR_sem  // (*4)
        pop RTR_queue and do work
        wakeup was because deps are now done
        NRI should be able to run now, so do it

*1 : the key primitive here is the ability to do a WFMO OR wait, and to know which one of the items signalled you. On Windows this is very easy, it's just WaitForMultipleObjects, which returns the guy who woke you. On other platforms it's trickier and probably involves rolling some of your own mechanisms.

Note that I'm assuming the semaphore Wait() will dec the semaphore at the time you get to run, and the OR wait on multiple semaphores will only dec one of them.

*2 : in practice you may get spurious wakeups or it may be hard to wait on all the dependencies, so you would loop and recheck the deps and possibly wait on them again.

How this differs from my previous system :

My previous system was more of a traditional "work stealing" scheme where each worker had its own queue and would try to just push & pop works from its own queue. This was lower overhead in the fast path (it avoids having a single shared semaphore that they have to contend on, for example), but it had a lot of problems.

Getting workers to go to sleep & wake up correctly in a work stealing scheme is a real mess. It's very hard to tell when you have no work to do, or when you have enough work that you need to wake a new worker, because you don't atomically maintain a work count (eg. a semaphore). You could fix this by making an atomic pair { work items, workers awake } and CAS that pair to maintain it, but that's just a messy way of being a semaphore.

The other problem was what happens when you have dependent work. You want a worker to go to sleep on the dependency, so that it yeilds CPU time, but wakes up when it can run. I had that, but then you have the problem that if somebody else pushes work that can immediately run, you want to interrupt that wait on the dependency and let the worker do the ready work. The semaphore OR wait fixes this nicely.

If you're writing a fractal renderer or some such nonsense then maybe you want to make lots of little work items and have minimal overhead. But that's a very special purpose rare case. Most of the time it's much more important that you do the right work when possible. My guiding principles are :

If there is no work that can be done now, workers should go to sleep (yield CPU)
If there is work that can be done now, workers should wake up
You should not wake up a worker and have it go back to sleep immediately
You should not have work available to do but the workers sleeping

Even in the "fractal renderer" sort of case, where you have tons of non-dependent work items, the only penalty here is one extra semaphore dec per item, and that's just a CAS (or a fetch_add) assuming you use something like "fastsemaphore" to fast-path the case of being not near zero count.

There is one remaining issue, which is when there is no ready-to-run work, and the workers are asleep on the first semaphore (they have no work items). Then you push a work item with dependencies. What will happen in the code sketch above is that the worker will wake up, pop the not ready item, then go back to sleep on the dependency. This violates article 3 of the resolution ("You should not wake up a worker and have it go back to sleep immediately").

Basically from *1 to *3 in the code is a very short path that wakes from one wait and goes into another wait; that's always bad.

But this can be fixed. What you need is "wait morphing". When you push a not-ready work item and you go into the semaphore code that is incrementing the NR_sem , and you see that you will be waking a thread - before you wake it up, you take it out of the NR_sem wait list, and put it into the NRI's dependency wait list. (you leave it waiting on RTR_sem).

That is, you just leave the thread asleep, you don't signal it to wake it up, it stays waiting on the same handle, but you move the handle from NR_sem to the dependency. You can implement this a few ways. I believe it could be done with Linux'es newer versions of futex which provide wait morphing. You would have to build your semaphore and your dependency waiting on futex, which is easy to do, then wait morph to transfer the wait. Alternatively if you build them on "waitset" you simply need to move an item from one waitset to the other. This can be done easily if your waitset uses a mutex to protect its internals, you simply lock both mutexes and move the waitable handle with both held.

The net result with wait morphing is very nice. Say for example are you workers are asleep. You create a work item that is dependent on an IO and push it. None of the workers get woken up, but one of them is changed from waiting on work available to waiting on the dependency. When the IO completes it wakes that worker and he runs. If somebody pushed a ton of work in the mean time, all the workers would be woken and they would do that work, and the dependent work would be pushed back on the NR queue and set aside while they did RTR work.

ADDENDUM : at the spot marked (*4) :

push NRI back on NR_queue and post NR_sem // (*4)
pop RTR_queue and do work

In real code you need do something a bit more complex here. What you do is something like :

if ( NRI is ready ) // double check
  RTR_sem.post() // we woke from RTR_sem , put it back
  do NRI work
  push NRI onto NR_lifo and post NR_sem
  pop RTR_queue and do work

we've introduced a new queue , the NR_lifo which is a LIFO (eg. stack). Now whenever you get an NR_sem post, you do :

// NR_sem just returned from wait so I know an NR item is available :

NRI = NR_lifo.pop()
if ( NRI == NULL )
  NRI = NR_queue.pop()

the item must be in one or the other and we prefer to take from the LIFO first. Basically the LIFO is a holding area for items that were popped off the FIFO and were not yet ready, so we want to keep trying to run those before we go back to the FIFO. You can use a single semaphore to indicate that there is an item in either queue.


Miroslav said...

Text does not show how nr-task can end in rtr-queue so it might be obvious. The question is: Can one particular nr-task get stuck if it will regularly go through pop from nr-queue front and immediately push to its back?

cbloom said...

"Text does not show how nr-task can end in rtr-queue so it might be obvious."

It doesn't, it stays in NR queue until its deps are done, at which point it will get executed at spot (*2)

"The question is: Can one particular nr-task get stuck if it will regularly go through pop from nr-queue front and immediately push to its back?"

Sort of; it shouldn't get stuck, eventually its deps will be done and it will be run and not re-pushed. You are correct however that as written I was popping off the front of a FIFO and putting on the back, which is not ideal.

I'm adding an addendum to the post about that with code mark (*4)

cbloom said...

Another point about that :

"The question is: Can one particular nr-task get stuck if it will regularly go through pop from nr-queue front and immediately push to its back?"

If you care about being "realtime" ; that is, if you want the FIFO to really be "fair" , there is an issue :

say you have some NR item and you push it.

then somebody just floods the system with submissions to the RTR queue.

Depending on how you implement your sem OR , you could never execute the NR item until all the RTR's are done.

This might be bad if you care about latency from submission to completion.

In particular if your semaphore OR wait does something like

if sem1.try_wait return
if sem2.try_wait return

then you prefer sem1 and will keep popping it until it is empty.

One simple way to fix this is to alternate or randomize the preference in the OR wait.

Drawknob said...

Re: tasks with dependencies. Have you loooked at phasers? www.cs.rice.edu/~vs3/PDF/Shirako-Sarkar-IPDPS-2010.pdf

old rants