cbloom rants: 03-06-12 - The Worker Wake and Semaphore Delay Issue

Let's say you have a pool of worker threads, and some of them may be asleep. How should you wake them up?

The straightforward way is to use a semaphore which counts the work items, and wait the worker threads on the semaphore. Workers will go to sleep when there is no work for them, and wake up when new work is pushed.

But this is rather too aggressive about waking workers. If you push N items (N less than the number of worker threads) it will wake N workers. But by the time some of those workers wake there may be nothing for them to do.

Let's look at a few specific issues.

First of all, when you're making a bunch of work items, you might want to delay inc'ing the semaphore until you have made all the items, rather than inc'ing it each time. eg. instead of :


1 : incremental push :

push item A
inc sem
push item B 
inc sem
...

instead do


2 : batch push :

push item A
push item B
inc sem twice

There are a few differences. The only disadvantage of batch pushing is that the work doesn't start getting done until all the pushes are done. If you're creating a lot of jobs and there's a decent amount of processing to get them started, this adds latency.

But what actually happens with incremental push? One possibility is like this :


bad incremental push :

push item A
inc sem

sem inc causes work thread to wake up
pusher thread loses execution

worker pops item A
worker does item A
worker sees empty queue and goes to sleep

pusher thread wakes up

push item B 
inc sem
...

That's very bad. The possible slight loss of efficiency from batch push is worth it to avoid this kind of bad execution flow.

There's a related issue when you are creating work items from a worker thread itself. Say a work item does some computation and also creates another work item :


Worker pops item A
does some stuff
push new work item B
inc sem
do some other stuff
item A done

Is this okay? Typically, no.

The problem is if the other worker threads are asleep, that inc sem might wake one up. Then the original worker finishes item A and sees nothing else to do and goes to sleep. It would have been much better if the worker just stayed awake and did item B itself.

We can fix this pretty easily. For work items pushed on worker threads, I typically use "batch push" (that is, delayed semaphore increment) with an extra wrinkle - I delay it up until my own thread tries to do a semaphore decrement.

That is, the way a worker thread runs is something like :


decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

decrement semaphore (wait if count <= 0)
pop item
do work item ...

instead we do :


decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

 push new work items but don't post the semaphore
 instead set N = number of incs to sem that are delayed

decrement semaphore AND add N (*)
pop item
do work item ...

The key operation is at (*) , where I post the sem for any work items I made, and also try to dec one.

The gain can be seen from a special case - if I made one work item, then the operation at (*) is a nop - I had one increment to the sem delayed, and I want to take one for myself, so I just take the one I had delayed. (if I made two, I post one and take one for myself). etc.

There is one extra little optimization you can do for the edge case - if you have some threads that are creating work items and some threads doing them, there is a sort of "performance race" between them. You really want them to be running along side with the creator feeding the poppers, neither running too fast. If the poppers are running slightly faster than the creators, you can fall off a huge performance cliff when the poppers see no work available and go into an OS sleep. Now, obviously you use a spin in your semaphore, but an enhanced version is something like this :


delayed/batched work creation :

push various items
...
inc sem N times


work popper :

spin { try pop queue }
try dec sem
if didn't get pop , dec sem (may wait)

In words, the work popper can "shortcut" the delayed sem inc. That is, the pusher has created a delay between the queue push and the sem inc, but the delay only applies to thread wakeups!! (this is the important point). The delay does not apply to the work being available to already running worker threads.

That is, if the pusher is using delay sem incs, and the popper is using sem shortcuts - then an active pusher makes work available to active workers as soon as possible. The thing that is delayed is only thread wakeup, so that we can avoid waking threads that don't need to wake up, or threads that will steal the execution context from the pusher, etc.

Here's an example of how things can go wrong if you aren't careful about these things :

Each work item is creating a followup work item, but that wakes up the other worker thread, who quickly grabs the item, and I go back to sleep.

(you might ask why the work item is creating a followup instead of just doing more work; it's because the followup is dependent on an IO; in this particular case the IO is running faster than the computation, so the IO dependency for each item is already done, and they can be run immediately)

With delayed push & early pop it's all cleaned up :

cbloom rants

3/06/2012

03-06-12 - The Worker Wake and Semaphore Delay Issue

No comments:

Post a Comment