Games are sort of a weird intermediate zone. We aren't quite what would be considered "real-time" apps (a true "real-time" app sets itself to the maximum possible priority and basically never allows any thread switches), nor are we quite "high performance computing" (which emphasizes throughput over latency, and also takes over the whole machine), but we also aren't just normal GUI apps (that can tolerate long losses of cpu time). We sort of have to compromise.
Let me lay out a few principals of what I believe is good threading for games, and why :
1. Make threads go to sleep when they have nothing to do (eg. don't spin). A true real-time or high-performance app might have worker threads that just spin on popping work items off a queue. This allows them to have few-hundred-clock latencies gauranteed. This is not appropriate for games. You are pretty much always running on a system that you need to share. Obviously on the PC they may have other apps running, but even on consoles there are system threads that you need to share with and don't have full control over. The best time to do this is when your threads are idle, so make your threads really go to sleep when they are idle.
2. Never use Sleep(n). Not ever. Not even once. Not even in your spin-backoff loops. In literature you will often see people talk of exponential backoffs and such. Not for games.
The issue is that even a Sleep(1) is 1 milli. 1 milli is *forever* in a game. If the sleep is on the critical path to getting a frame out, you can add 1 milli to your frame time. If you hit another sleep along the critical path you have another millisecond. It's no good.
3. All Waits should be on events/sempahores/what have you (OS waitable handles) so that sleeping threads are woken up when they can proceed, not on a timer.
4. Thread switches should be your #1 concern. (other than deadlocks and races and such things that are just bugs). Taking a mutex is 100 clocks or so. Switching threads is 10,000 clocks or so. It's very very very bad. The next couple of points are related to this :
5. "Spurious" or "polling" wakeups are performance disasters. If a thread has to wake up, check a condition, and then goes back to sleep, you just completely wasted 10,000 clocks. Ensure that threads are waiting on a specific condition, and they are only woken up when that condition is true. This can be a bit tricky when you need to wait on multiple conditions (and don't have Win32's nice WaitForMultiple to help you).
6. You should sacrifice some micro-efficiency to reduce thread thrashing. For example, consider the blocking SPCS queue using semaphore that we talked about recently. If you are using that to push work items that take very little time, you can have the following pattern :
main thread pushes work worker wakes up , pops work, does it, sees nothing available, goes back to sleep main thread pushes work worker wakes up , pops work, does it, sees nothing available, goes back to sleepconstantly switching. Big disaster. There are a few solutions. The best in this case would be to detect that the work item was very trivial and just do it immediately on the main thread rather than pushing it to a worker. Another would be to batch up a packet of work and send all of them at once instead of posting the semaphore each time. Another is to play with priorities - bump up your priority while you are making work items, and then bump it back down when you want them all to fire.
Basically doing some more work and checks that reduce your max throughput but avoids some bad cases is the way to go.
7. Mutexes can cause bad thread thrashing. The problem with mutexes (critical sections) is not that they are "slow" (oh no, 100-200 clocks big whoop) in the un-contended case, it's what they do under contention. They block the thread and swap it out. That's fine if that's actually what you want to do, but usually it's not.
Consider for example a queue that's mutex protected. There are a bunch of bad articles about the performance of mutex-protected queues vs. lock-free queues. That completely misses the point. The problem with a mutex-protected queue is that when you try to push an item and hit contention, your thread swaps out. Then maybe the popper gets an item and your thread swaps back in. Thread swapping in & out just to push an item is terrible. (obviously mutexes that spin before swapping try to prevent this, but they don't prevent the worst case, which means you can have unpredictable performance, and in games the most important thing is generally the worst case - you want to minimize your maximum frame time)
Basically, mutexing is not a good reason to swap threads. You should swap threads when you are out of work to do, not just to let someone get access to a blocked shared variable, and then you'll run some more when they're done.
8. Over-subscription is bad. Over-subscription (having more threads than cores) inherently creates thread switching. This is okay if the threads are waiting on non-computational signals (eg. IO or network), but otherwise it's unnecessary. Some people think it's elegant to make a thread per operation type, eg. a ray cast thread, a collision thread, a particle thread, etc. but in reality all that does is create thread switch costs that you don't need to have.
In video games, you basically don't want the time-slicing aspect of threads to ever happen (in the absense of other apps). You want a thread to run until it has to wait for something. Over-subscribing will just cause time-slice thread switching that you don't need. You don't want your render thread and your physics thread to time-slice and swap back and forth on one core. If they need to share a core, you want all your physics work to run, then all your render work to run, sequentially.
9. Generally a thread per task is bad. Prefer a thread per core. Make a task system that can run on any thread. You can then manually sequence tasks on the cores by assigning them to threads. So in the example above instead have a physics task and a render task, which might be assigned to different cores, or might be assigned to one thread on one core and run sequentially.