04-06-09 - The Work Dispatcher

I thought I'd write a bit about my multithreaded "Worklet" dispatcher before I forget about it. I call little units of work "worklets" just because I like to be nonstandard and confusing.

The basic idea is that the main thread can at any time fire up a bunch of worklets. The worklets then go to a bunch of threads and get done. The main thread can then wait on the worklets.

There are a few things I do differently in the design from most people. The most common "thread pool" and "job swarm" things are very simple - jobs are just an isolated piece of independent work, and often once a bunch is fired you can only wait on them all being done. I think these are too limiting to be really generally useful, so I added a few things.

1. Worker threads that are not in use should go completely to sleep and take 0 cpu time. There should be a worker thread per processor. There might also be other threads on these processors, and we should play nice with arbitrary other programs running and taking some of the cpu time. Once a worker thread wakes up to do work, it should stay awake as long as possible and do all the work it can, that is, it shouldn't have to go to sleep and wait for more work to get fired so it can wake up again.

2. Worklets can have dependencies on other worklets. That way you can set up a dependency tree and fire it, and it will run in the right order. eg. if I want to run A, then B, then run C only after A & B are both done, you fire worklets {A}, {B} and {C: dependent on AB}. Dependencies can be evaluated by the worker threads. That's crucial because it means they don't need to stall and wait for the main thread to fire new work to them.

3. The main thread can block or check status on any Worklet. The main thread (or other game threads) might keep running along, and the worker threads may be doing the work, and we want to be able to see if they are done or not. In particular we don't want to just support the OpenMP style "parallel for" where we fork a ton of work and then immediately block the main thread on it - we want real asynchronous function calls. Often in games we'll want to make the main thread(s) higher priority than the worker threads, so that the worker threads only run in idle time.

The actual worker threads I implemented with a work-stealing scheme. It's not true work-stealing at all, because there is a concept of a "main thread" that runs the game and pushes work, and there's also the dependencies that need to be evaluated. All of the thread-thread communication is lock free. When the main thread adds new work items it just jams them onto queues. When the worker threads pop off work items they just pop them off queues. I do currently use a lock for dependency evaluation.

Traditional work stealing (and the main papers) are designed for operating system threads where the threads themselves are the ones making work. In that environment, the threads push work onto queues and then pop it off for themselves or steal from peers. There are custom special lock-free data structures designed for this kind of operation - they are fast to push & pop at one end, but also support popping at the other end (stealing) but more slowly. What I'm doing is not traditional work stealing. I have external threads (the "game threads") that do not participate in the work doing, but they can push work to the workers. In my world the workers currently can never make new work (that could be added if it's useful).

There are a lot of nice things about work stealing. One is you don't need a seperate dispatcher thread running all the time (which would hurt you with more context switches). Another is that workers who have cpu time can just keep jamming along by stealing work. They sort of do their own dispatching to themselves, so the threads that have the cpu do the work of the dispatching. It also offloads the dispatching work from the main thread. In my system, the workers do all work they can that's known to be depency-okay. Once that work is exhausted they reevaluate dependencies to see if more work can be done, so they do the dependency checking work for themselves off the main thread.

Another nice thing about work stealing is that it's self-balancing in the face of external activity. Anything running on a PC has to face lots of random CPU time being stolen. Even in a console environment you have to deal with the other threads taking variable amounts of time. For example if you have 6 cores, you want 6 workers threads. But you might also have 3 other threads, like a main thread, a gpu-feeding thread, and a sound thread. The main thread might usually take 90% of the cpu, so the worker on that core rarely gets any time, the gpu-feeder might usually take 50% of the cpu time on that thread, but in phases, like it takes 100% of the cpu for half the frame then goes idle for the other half. With work stealing your worker thread will automatically kick in and use that other time.

In order for the self balancing to work as well as possible you need small worklets. In fact, the possible wasted time is equal to the duration of the longest task. The time waste case happens like this :

Fire N work items , each taking time T to N cores
For some reason one of the cores is busy with something else so only (N-1) of them do work
Now you block on the work being done and have to wait for that busy core, so total time is 2T

Total time should have been N*T/(N-1)
For N large the waste approaches T

Basically the smaller T is (the duration of longest work) the more granular the stealing self-allocation is. Another easy way to see it is :

You are running N workers and a main thread
The main thread takes most of the time on core 0
You fire (N-1) very tiny work items and the (N-1) non-main cores pick them up
You fire a very large work item and core 0 worker picks it up

That's a disaster. The way to avoid it is to never fire single large work items - if you would have split that into lots of little work items it would have self-balanced, because the stealing nature means that only worker threads that have time take tasks.

For example, with something like a DXTC encoder, rather than split the image into something like N rectangles for N cores and fire off 1 work item to each core, you should go ahead and split it into lots of tiny blocks. Of course this requires that the per-work-item overhead is extremely low, which of course it is because we are all lock-free and goodness.

There are some things I haven't done yet that I might. One is to be a bit smarter about the initial work dispatching, try to assign to the CPU that will have the most idle time. If you actually did make nice tiny worklets all the time, that wouldn't be an issue, but that's not always possible. In the case that you do make large work items, you want those to go the cores that aren't being used by other threads in your game.

Another issue is the balance of throughput vs latency. That is, how fast does the system retire work, vs. how long does it take any individual work item to get through. Currently everything is optimized for throughput. Work is done in a roughly FIFO order, but with dependencies it's not gauranteed to be FIFO, and with the work stealing and variations in CPU Time assignment you can have individual work items that take a lot longer to get through the system than is strictly necessary. Usually this isn't a big deal, but sometimes you fire a Worklet and you need it to get done as quickly as possible. Or you might need to get done inside a certain deadline, such as before the end of the frame. For example you might fire a bunch of audio decompression, but set a deadline to ensure it's done before the audio buffers run out of decompressed data. Handling stuff like that in a forward-dispatched system is pretty easy, but in work-stealing it's not so obvious.

Another similar issue is when the main thread decides to block on a given work item. You want that item to get done as soon as possible by the thread that has the highest probability of getting a lot of CPU time. Again not easy with work-stealing since some worker thread may have that item in its queue but not be getting much CPU time for some reason.


won3d said...

So how is the dependency graph represented in the library, and how is it expressed in client code?

cbloom said...

Each Worklet has an Id, and each Worklet can depend on a list of other Ids. Once a Worklet is fired you can't change its dependencies.

This creates a tree structure of dependencies. You know you can always run the leaves.

In practice I don't actually build a tree and there are some little clever bits to make dependency checking very fast.

It does still have overhead because you need an Id <-> Worklet map. Because of that there's also a fast path for Worklets that don't depend on anyone and that noone else can depend on, so they just get fired and don't participate in the map.

John W. Ratcliff said...


Great post. Is this something you are planning on open sourcing? I have been waiting to add such features to my job swarm until I need them and, unfortunately, I haven't been writing any code lately that needs it.

The batching option is something that is a really common use case and I will provide it the next time I do a source update.


cbloom said...

No, this is probably not going in the open source. This is work for RAD that will be in my product Oodle, and a lot of the black art of making it really nice is tweaking the details of dealing with threads and scheduling on all the different platforms and operating systems.

For example for SPUs the internal function will be very different; it will probably just be forward-dispatched, not work-stealing.

I did open source the lock-free primitives that I did for this, which should make it pretty easy to roll your own.

cbloom said...

BTW work-stealing self balancing is also great if your worklets take variable amounts of time that's hard to predict.

Consider for example an SPU or LRB environment where you have workers that are 100% under your control so you don't have to worry about them losing CPU time to other threads.

Say you want to fire 80 worklets to 8 workers. A forward dispatcher might just push 10 items onto each worker. That's fine if you know that all the work items take the same time, but if they take variable amounts of time it can be arbitrarily bad.

For example if one item takes time 10*T , and the other 79 all take time T, your total time will be 19*T when it only needs to be 11.1*T

Obviously to really do a forward dispatcher right you would have to have a dispatcher thread running all the time and watching the workers and redistributing the work to rebalance things. Basically you wind up writing a whole scheduler like an OS which is a pretty scary piece of code to write. Work stealing basically does a decent job of this for you for free.

Assen said...

I hope you get email about thread comment necromancy...

What do you do once a thread is idle, and it can't find anything to steal? Basically the system has emptied itself of worklets - how do you sleep your worker threads so they wake efficiently when there's work to do?

I can't think of anything clever.
You can put them all to sleep on their own events, and signal each event from the main thread when you add the first worklet onto a worker's thread currently empty queue; but then you have to be careful to spread work, eg to assign it in round-robin order. This feels a bit wrong to me - the beauty of work-stealing should allow me to dump all the work on the first worker, and let it spread automatically.

cbloom said...

By far the simplest and most elegant way is just to use a counting semaphore.

Each time you push a work, "Post" (inc) the semaphore. Each time you pop a work, "Wait" (dec) the semaphore.

Then if there is no work, the dec is a sleep. When you push new work, the right number of guys get woken.

In practice you want to use something like "fastsemaphore" that I posted previously so that only the inc/decs which cause thread wake/sleeps are kernel calls, and other cases are just a lockfree inc/dec.

As long as your workers only ever sleep due to the work queue being empty, this is a great and very clean solution.

In my implementation, the worker threads can sleep for other reasons, so I have a more complicated solution.

Also I believe I posted somewhere about the fact that you may want to spin a bit when your fastsemaphore detects that it is waking or sleeping a thread, because if the action can get handled very soon without a thread transition that is preferable. Obviously this depends on usage pattern, etc.

old rants