7/13/2011

07-13-11 - Good threading design for games

I want to back up and talk a bit about how threading should be used in modern games and what makes good or bad threading design in general.

Games are sort of a weird intermediate zone. We aren't quite what would be considered "real-time" apps (a true "real-time" app sets itself to the maximum possible priority and basically never allows any thread switches), nor are we quite "high performance computing" (which emphasizes throughput over latency, and also takes over the whole machine), but we also aren't just normal GUI apps (that can tolerate long losses of cpu time). We sort of have to compromise.

Let me lay out a few principals of what I believe is good threading for games, and why :

1. Make threads go to sleep when they have nothing to do (eg. don't spin). A true real-time or high-performance app might have worker threads that just spin on popping work items off a queue. This allows them to have few-hundred-clock latencies gauranteed. This is not appropriate for games. You are pretty much always running on a system that you need to share. Obviously on the PC they may have other apps running, but even on consoles there are system threads that you need to share with and don't have full control over. The best time to do this is when your threads are idle, so make your threads really go to sleep when they are idle.

2. Never use Sleep(n). Not ever. Not even once. Not even in your spin-backoff loops. In literature you will often see people talk of exponential backoffs and such. Not for games.

The issue is that even a Sleep(1) is 1 milli. 1 milli is *forever* in a game. If the sleep is on the critical path to getting a frame out, you can add 1 milli to your frame time. If you hit another sleep along the critical path you have another millisecond. It's no good.

3. All Waits should be on events/sempahores/what have you (OS waitable handles) so that sleeping threads are woken up when they can proceed, not on a timer.

4. Thread switches should be your #1 concern. (other than deadlocks and races and such things that are just bugs). Taking a mutex is 100 clocks or so. Switching threads is 10,000 clocks or so. It's very very very bad. The next couple of points are related to this :

5. "Spurious" or "polling" wakeups are performance disasters. If a thread has to wake up, check a condition, and then goes back to sleep, you just completely wasted 10,000 clocks. Ensure that threads are waiting on a specific condition, and they are only woken up when that condition is true. This can be a bit tricky when you need to wait on multiple conditions (and don't have Win32's nice WaitForMultiple to help you).

6. You should sacrifice some micro-efficiency to reduce thread thrashing. For example, consider the blocking SPCS queue using semaphore that we talked about recently. If you are using that to push work items that take very little time, you can have the following pattern :

  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
  main thread pushes work
  worker wakes up , pops work, does it, sees nothing available, goes back to sleep
constantly switching. Big disaster. There are a few solutions. The best in this case would be to detect that the work item was very trivial and just do it immediately on the main thread rather than pushing it to a worker. Another would be to batch up a packet of work and send all of them at once instead of posting the semaphore each time. Another is to play with priorities - bump up your priority while you are making work items, and then bump it back down when you want them all to fire.

Basically doing some more work and checks that reduce your max throughput but avoids some bad cases is the way to go.

7. Mutexes can cause bad thread thrashing. The problem with mutexes (critical sections) is not that they are "slow" (oh no, 100-200 clocks big whoop) in the un-contended case, it's what they do under contention. They block the thread and swap it out. That's fine if that's actually what you want to do, but usually it's not.

Consider for example a queue that's mutex protected. There are a bunch of bad articles about the performance of mutex-protected queues vs. lock-free queues. That completely misses the point. The problem with a mutex-protected queue is that when you try to push an item and hit contention, your thread swaps out. Then maybe the popper gets an item and your thread swaps back in. Thread swapping in & out just to push an item is terrible. (obviously mutexes that spin before swapping try to prevent this, but they don't prevent the worst case, which means you can have unpredictable performance, and in games the most important thing is generally the worst case - you want to minimize your maximum frame time)

Basically, mutexing is not a good reason to swap threads. You should swap threads when you are out of work to do, not just to let someone get access to a blocked shared variable, and then you'll run some more when they're done.

8. Over-subscription is bad. Over-subscription (having more threads than cores) inherently creates thread switching. This is okay if the threads are waiting on non-computational signals (eg. IO or network), but otherwise it's unnecessary. Some people think it's elegant to make a thread per operation type, eg. a ray cast thread, a collision thread, a particle thread, etc. but in reality all that does is create thread switch costs that you don't need to have.

In video games, you basically don't want the time-slicing aspect of threads to ever happen (in the absense of other apps). You want a thread to run until it has to wait for something. Over-subscribing will just cause time-slice thread switching that you don't need. You don't want your render thread and your physics thread to time-slice and swap back and forth on one core. If they need to share a core, you want all your physics work to run, then all your render work to run, sequentially.

9. Generally a thread per task is bad. Prefer a thread per core. Make a task system that can run on any thread. You can then manually sequence tasks on the cores by assigning them to threads. So in the example above instead have a physics task and a render task, which might be assigned to different cores, or might be assigned to one thread on one core and run sequentially.

11 comments:

Ash said...

Interesting read Charles. Thanks for sharing. Are you aware of any books / series of articles that shed more light on such practical (as opposed to academical which I already have) knowledge of threading?

Thanks.

cbloom said...

Not that I know of. There are some good talks by the Killzone guys and maybe some by Insomniac guys, though that's mostly about PS3 SPU task stuff, it's still relevant because the ideal way to thread CPU's for games is a lot like how you run SPU's.

nothings said...

I am not aware of any alternative to calling Sleep() for an audio thread that feeds audio data out to the audio API. At least not if the API doesn't have some way to signal or interrupt you. So I wouldn't say never, but yes, it's very close to never. (Iggy has three sleeps, all involving the audio thread, and two of them shouldn't be in there and I need to set up some sort of signal to avoid it.)

Your point #7 got cut off in the middle.

gerhans said...

Thanks for the post Charles. Would you be willing to expound on strategies for avoiding Sleep() under Windows?

cbloom said...

"I am not aware of any alternative to calling Sleep() for an audio thread that feeds audio data out to the audio API."

Yeah that is a special case. I'm not sure how the really good low-latency audio apps are written. I think some of them actually implement a DPC device for doing audio processing, that way you actually get called by the system like an interrupt when it nees some buffer.

One idea : make one thread that's devoted to filling audio buffers. Make it super high priority and put on a waitable timer so it pumps at the required interval to fill buffers. Of course a waitable timer is equivalent to a Sleep, but it's more reliable because it gives you a signal on a steady interval. The main thing is to make that thread super-high priority, that way the timer (or sleep) durations actually are what you think they are, and make it do nothing but fill the buffer. It can use a semaphore or some other mechanism to notify other threads when it needs them to give it more audio data.

So the idea is that the one thread that talks to a non-computational resource (the audio hardware) has to use periodic polling, but then it translates its "need buffer" condition into a signallable event that other threads can wait on. So the sleeping part of the system is isolated and has no risk of cascading.

cbloom said...

"Thanks for the post Charles. Would you be willing to expound on strategies for avoiding Sleep() under Windows?"

Win32 is actually a great API for avoiding sleep.

Maybe I'll do a longer example and make it a post.

ryg said...

"Yeah that is a special case. I'm not sure how the really good low-latency audio apps are written."
The really low-latency stuff uses either ASIO or the Kernel Streaming interface (direct pipe into the low-level audio mixer right in front of the driver).

Not sure about KS, but ASIO can actually give you callbacks directly from the interrupt DPC (Regular ASIO is exclusive for that reason). And then there's ASIO4ALL that implement ASIO on top of KS.

cbloom said...

I once spent some time trying to figure out how to get the digital bits of an audio file played directly out of my digital audio out port without Windows fucking anything up with volume multipliers or whatever nonsense. It can be done with foobar2000 and Kernel Streaming, but it's a pain in the ass so I gave up. Maybe I'll try to revive that some day because that's how I want my music to play.

nothings said...

ASIO is pretty nice for musicians, because it locks the record and playback buffers to each other in a really nice way for making them lockstep and minimizing latency.

For just audio playback, it seems like overkill.

Josh Greifer said...

Thanks for the post, Charles and for @visualc for tweeting.

On point 2, I cenrtainly agree about Sleep(n), but there is very useful, even vital, variant, which is

SleepEx(0);

The above call lets the system process any pending queued asynchronous procedures, and handle pending IO callbacks, *without* waiting.

Under windows, we have direct OS support for the "proactor" design pattern, which in my experience leads to to the fastest, and in some ways simplest design. In this pattern, the application runs in a single user thread, yet never waits for anything, but proceeds by spinning off all time-consuming tasks and I/O requests to OS-managed threads (which will usually be hyperthreads).
A definition of "time-consuming" on a modern system could be as little as be 10 nanoseconds. When building apps that use the proactor pattern, you can either use the boost::asio library, (no relation to the ASIO developed by Steinbeg), or, under Windows, you can call the OS directly, using APCs and the calls that end in "Ex:"

QueueUserAPC(), WaitForMultipleObjectsEx, SleepEx, ReadFileEx, etc.

On point 6, I use macros which allow me at design-time to easily switch between spinning off a task into an APC or running it in the main thread. I make the final choice after profiling the application.

On point 7, you seem to imply that mutexes and critical sections are the same thing, which is quite wrong! Mutexes, as you say, are slow heavyweight systemwide objects which carry a lot of bookkeeping overhead, but are vital (and efficient) for interprocess synchronization. But critical sections are relatively lightweight and easy (but unnecessary) to implement oneself. Basically all you need to roll your own critical section is one "Bit Test and Set" machine code instruction in a tight loop.

Semaphores are useful constructs, particularly when you use the proactor pattern. If you need to process five audio buffers for example, raising a semaphore with a count of five will instruct the OS to, in effect, queue 5 calls to a buffer handling routine, rather than having to raise a sempahore once for each buffer.

All in all, my advice on getting blinding performance is to delegate as much of the threading logic as possible to the OS (and to the GPU where available). Which is not to say that it's also very important to try to code all this logic by hand, if only to get a deep knowledge of how the OS does it.

cbloom said...

Hey Josh, a couple of points -

A. In general I agree that APC's in Windows are a nice system. I wrote an earlier post about how lots of async work is better done as a little callback on an already running thread (rather than sending a message over to another thread). Windows as of Vista+ also has a very nice generalized worker thread pool system for doing APC's.

While that's all very nice, it's not portable.

I would be a little worried about using SleepEx as a way to pump APC's, but maybe it's okay. It seems like there might be unexpected side effects to that and I wonder if there is a more obviously side-effect free way to
pump APC's.

Anyway, I like the APC way of writing code a lot, and to some extent Oodle has become a cross-platform way of doing Windows-like APC's. (eg. things like "run this snippet of code after this IO finishes")

B. As for CriticalSections, I disagree.

It's crucial to distinguish between the Windows specifics and the general concepts.

Generally if I say "mutex" I mean it in the algorithmic sense, as in "something that provides mutual exclusion".

In that sense, a CriticalSection is a form of mutex.

You say a few things about CriticalSection that are just not true :

"But critical sections are relatively lightweight and easy (but unnecessary) to implement oneself."

CS is most definitely NOT easy to implement yourself. It uses a bit-test-and-set spin loop, but then automatically allocates an OS event and will wait on that to actually put the thread to sleep. (see for example my "Event mutex that makes event on demand" in an earlier post).

Windows CS also has deadlock debugging support, recursion, priority inversion protection, etc. It's quite complex actually.

"Basically all you need to roll your own critical section is one "Bit Test and Set" machine code instruction in a tight loop."

Not true, I'm not sure if you're simplifying or misunderstanding CS, but in fact CS becomes just like a kernel "Mutex" when there is contention. It simply has a fast path for the no-contention case. This is similar to fast mutexes on most platforms these days.

And it is because of this fact that CS is not actually any better about thread blocking than "Mutex" under windows. Yes, CS has a much faster best case, but it has the exact same worst case, which is threads going to sleep and waking up.

For example a "lock convoy" with CS's is just as bad as a lock convoy with Mutexes.

old rants