9/12/2010

09-12-10 - The defficiency of Windows' multi-processor scheduler

Windows' scheduler is generally pretty good, but there appears to be a bit of shittiness with its behavior on multicore systems. I'm going to go over everything I know about it and then we'll hit the bad point at the end.

Windows as I'm sure you know is a strictly exclusive high priority scheduler. That means if there are higher priority threads that can run, lower priority threads will get ZERO cpu time, not just less cpu time. (* this is not quite true because of the "balanced set manager" , see later).

Windows' scheduler is generally "fair", and most of its lists are FIFO. The scheduler is preemptive and happens immediately upon events that might change the scheduling conditions. That is, there's no delay or no master scheduler code that has to run. For example, when you call SetThreadPriority() , if that affects scheduling the changes will happen immediately (eg. if you make the current thread lower than some other thread that can run, you will stop running right then, not at the end of your quantum). Changing processor affinitiy, Waits, etc. all cause immediate reschedule. Waits, Locks, etc. always put your thread onto a FIFO list.

The core of the scheduler is a list of threads for each priority, on each processor. There are 32 priorities, 16 fixed "realtime" priorities, and 16 "dynamic" priorities (for normal apps). Priority boosts (see later) only affect the dynamic priority group.

When there are no higher priority threads that can run, your thread will run for a quantum (or a few, depending on quantum boosts and other things). When quantum is done you might get switched for another thread of same priority. Quantum elapsed decision is now based on CPU cycles elapsed (TSC), not the system timer interrupt (change was made in Vista or sumfin), and is based on actual amount of CPU cycles your thread got, so if your cycles are being stolen you might run for more seconds than a "quantum" might indicate.

Default quantum is an ungodly 10-15 millis. But amusingly enough, there's almost always some app on your system that has done timeBeginPeriod(1), which globally sets the system quantum down to 1 milli, so I find that I almost always see my system quantum at 1 milli. (this is a pretty gross thing BTW that apps can change the scheduler like that, but without it the quantum is crazy long). (really instead of people doing timeBeginPeriod(1) what they should have done is make the threads that need more responsiveness be very high priority and then go to sleep when not getting input or events).

So that's the basics and it all sounds okay. But then Windows has a mess of hacks designed to fix problems. Basically these are problems with people writing shitty application code that doesn't thread right, and they're covering everyone's mistakes, and really they do a hell of a good job with it. The thing they do is temporary priority boosts and temporary quantum boosts.

Threads in the foreground process always get a longer quantum (something like 60 millis on machines with default quantum length!). Priority boosts affect a thread temporarily, each quantum the priority boost wears off one step, and the thread gets extra quantums until the priority boost is gone. So a boost of +8 gives you 8 extra quantums to run (and you run at higher priority). You get priority boosts for :


 IO completion (+variable)

 wakeup from Wait (+1)
   special boost for foreground process is not disableable

 priority inversion (boosts owner of locked resource that other thread is waiting on)
   this is done periodically in the check for deadlock
   boosts to 14

 GUI thread wakeup (+2) eg. from WaitMsg

 CPU starvation ("balanced set manager")
    temporary kick to 15

boost for IO completion is :
    disk is 1
    network 2
    key/mouse input 6
    sound 8
(boost amount comes from the driver at completion time)

Look at like for example the GUI thread wakeup boost. What really should have been happening there is you should have made a GUI message processing thread that was super minimal and very high priority. It should have just done WaitMsg to get GUI input and then responded to it as quickly as possible, maybe queued some processing on a normal priority thread, then gone back to sleep. The priority boost mechanism is basically emulating this for you.

A particularly nasty example is the priority inversion boost. When a low priority thread is holding a mutex and a high priority thread tries to lock it, the high priority thread goes to sleep, but the low priority thread might never run if there are medium priority threads that want to run, so the high priority thread will be stuck forever. To fix this, Windows checks for this case in its deadlock checker. All of the "INFINITE" waits in windows are not actually infinite - they wait for 5 seconds or so (delay is setable in the registry), after which time Windows checks them for being stalled out and might give you a "process deadlocked" warning; in this check it looks for the priority inversion case and if it sees it, it gives the low priority thread the big boost to 14. This has the wierd side effect of making the low priority thread suddenly get a bunch of quanta and high priority.

The other weird case is the "balanced set manager". This thing is really outside of the normal scheduler; it sort of sits on the side and checks every so often for threads that aren't running at all and gives them a temporary kick to 15. This kick is different than the others in that it doesn't decay (it would get 15 quanta which is a lot), it just runs a few quanta at 15 then goes back to its normal priority.

You an use SetThreadPriorityBoost to disable some of these (such as IO completion) but not all (the special foreground process stuff for example is not disabled by this, and probably not the balanced set or priority inversion stuff either is my guess).

I'm mostly okay with this boosting shit, but it does mean that actually accounting for what priority your thread has exactly and how long you expect it to run is almost impossible in windows. Say you make some low-priority worker thread at priority 6 and your main thread at priority 8. Is your main thread actually higher priority than the background worker? Well, no, not if he got boosted for any of various reasons, he might actually be at higher priority right now.

Okay, that's the summary of the normal scheduler, now let's look at the multi-processor defficiency.

You can understand what's going on if you see what their motivation was. Basically they wanted scheduling on each individual processor to still be as fast as possible. To do this, each processor gets its own scheduling data; that is there are the 32 priority lists of threads on each processor. When a processor has to do some scheduling, it tries to just work on its own list and schedule on itself. In particular, simple quantum expiration thread switches can happen just on the local processor without taking a system-wide lock. (* this was changed for Vista/Win7 or sumpin; old versions of the SMP scheduler took a system-wide spinlock to dispatch; see references at bottom).

(mutex/event waits take the system-wide dispatch lock because there may be threads from other processors on the wait lists for those resoures)

But generally Windows does not move threads around between processors, and this is what can create the problems. In particular, there is absolutely zero code to try to distribute the work load evenly. That's up to you, or luck. Even just based on priorities it might not run the highest priority threads. Let's look at some details :

Each thread has the idea of an "ideal" processor. This is set at thread creation in a global round-robin and a processor round-robin. This is obviously a hacky attempt to balance load which is sometimes good enough. In particular if create a thread per core, this will give you an "ideal" processor on each core, so that's what you want. It also does assign them "right" for hyperthreading, that is to minimize thread overlap, eg. it will assign core0-hyperthread0 then core1-hyperthread0 then core0-hyperthread1 then core1-hyperthread1. You can also change it manually with SetThreadIdealProcessor , but it seems to me it mostly does the right thing so there's not much need for that. (note this is different than Affinity , which forces you to run only on those processors , you don't have to run on your ideal proc, but we will see some problems later).

When the scheduler is invoked and wants to run a thread - there's a very big difference between the cases when there exist any idle processors or not. Let's look at the two cases :

If there are any idle processors : it pretty much does exactly what you would want. It tries to make use of the idle processors. It prefers whole idle cores over just idle hyperthreads. Then from among the set of good choices it will prefer your "ideal", then the ones it last ran on and the one the scheduler is running on right now. Pretty reasonable.

If there are not any idle processors : this is the bad stuff. The thread is only schedules against its "ideal" processor. Which means it just gets stuffed on the "ideal" processor's list of threads and then schedules on that processor by normal single-proc rules.

This has a lot of bad consequences. Say proc 0 is running something at priority 10. Proc 1-5 are all running something at priority 2. You are priority 6 and you want to run, but your ideal proc happened to be 0. Well, tough shit, you get stuck on proc 0's list and you don't run even though there's lots of priority 2 work getting done.

This can happen just by bad luck, when you happen to run into other apps threads that have the same ideal proc as you. But it can also happen due to Affinity masks. If you or somebody else has got a thread locked to some proc via the Affinity mask, and your "ideal" processor is set to try to schedule you there, you will run into them over and over.

The other issue is that even when threads are redistributed, it only happens at thread ready time. Say you have 5 processors that are idle, on the other processor there is some high priority thread gobbling the CPU. There can be threads waiting to run on that processor just sitting there. They will not get moved over to the idle processors until somebody kicks a scheduling event that tries to run them (eg. the high priority guy goes to sleep or one of the boosting mechanisms kicks in). This is a transient effect and you should never have long term scenarios where one processor is overloaded and other processors are idle, but it can happen for a few quanta.

References :

Windows Administration Inside the Windows Vista Kernel Part 1
Windows Administration Inside the Windows Vista Kernel Part 2
Sysinternals Freeware - Information for Windows NT and Windows 2000 - Publications
Inside the Windows NT Scheduler, Part 1 (access locked)
Available here : "Inside the Windows NT Scheduler" .doc version but make sure you read the this errata

Stupid annoying audio-visual formatted information. (god damnit, just use plain text people) :

Mark Russinovich Inside Windows 7 Going Deep Channel 9
Dave Probert Inside Windows 7 - User Mode Scheduler (UMS) Going Deep Channel 9
Arun Kishan Inside Windows 7 - Farewell to the Windows Kernel Dispatcher Lock Going Deep Channel 9

(I haven't actually watched these, so if there's something good in them, please let me know, in text form).

Also "Windows Internals" book, 5th edition.

ADDENDUM : The real interesting question is "can we do anything about it?". In particular, can you detect cases where you are badly load-balanced and kick Windows in some way to adjust things? Obviously you can do some things to load-balance within your own app, but when other threads are taking time from you it becomes trickier.

12 comments:

  1. We just got context switches rendering in Telemetry - it's amazing how many bad decisions it makes...

    ReplyDelete
  2. There's another thing timeBeginPeriod() does that is far more important in my experience: It affects the resolution of timers and time functions (e.g. GetTickCount() and timeGetTime()).
    These are important fallbacks given the absence of other reliable time sources (RDTSC isn't one as you probably know). But the default 15 (desktop) or 10 (server) ms resolution is way to coarse. So you often end up calling timeBeginPeriod(1) just to get better time accuracy.

    ReplyDelete
  3. Yeah that's one of the reasons why you will always see a clock res of 1 milli, because somebody has done it to get more precision in their timer.

    It's fucking retarded that the timer function call resolution is tied to the OS scheduler, BTW.

    BTW see also

    http://cbloomrants.blogspot.com/2009/04/04-23-09-telling-time.html

    http://cbloomrants.blogspot.com/2009/03/03-02-09-sleep-sucks-and-vsync-woes.html

    ReplyDelete
  4. I think setting affinity for a single threaded app like that is a *very* bad idea.

    It might improve performance a little bit in some cases, but in other cases it will *cripple* overall machine performance.

    Imagine if everyone was setting cpu affinity to the same core. Basically you've killed my multiprocessing.

    The only time I use affinity is when I am making a bunch of threads myself and I want to control how they distribute onto cores.

    For example in a video game you might put your main game loop on core 0, your threaded physics work on core 1, and your background loader on core 2.

    Even that is not actually ideal, but you can't really do much better without more feedback from the OS which doesn't exist.

    ReplyDelete
  5. The other big case is my Worklet work stealing system where I intentionally make a thread per core and lock it to each core. Then by work stealing it automatically redistributes work to the cores that have available time.

    ReplyDelete
  6. No, the question there was how to demonstrate that thread rescheduling
    to other cores can hurt the performance - with something simpler than a compressor.
    Also I still don't know why Windows does that even on a system with a single non-idle process.
    And another question was how to lock a thread to the core which was originally allocated to it - without Vista+ APIs.
    Anyway, with that compressor I ended up making it a commandline option.

    ReplyDelete
  7. Yeah I understood what the question was and I'm telling you it's not something you should be doing. It's very bad practice.

    ReplyDelete
  8. So, the problem with work stealing style designs for me is how do you deal with the last thread?

    Like, Bink breaks up the compression on 8x8 blocks - so, if you start 24 threads, they all start working away at a very fine grain.

    Eventually, some other thread on the system runs and blocks, say, thread 17. The rest of the my threads pitch in and finish the compression, but frame 17 is still sitting on compression that *one* little 8x8 block.

    I can tell do that block, but that thread is just sitting there waiting to run, and possibly overwriting some memory when it finally starts back up.

    So, I have to wait before I continue processing for that last thread just to end.

    Super-annoyingly, Windows could just move the thread that interrupted me to another core (since they are all idle), and then everything will finish, but I can't force that to happen.

    Grrrr.

    ReplyDelete
  9. Yeah that's the issue Tim Farrar writes about.

    I think that could be addressed though.

    Dupicating the work and just doing it on the main thread is one option.

    Another is if the main thread blocks on the async work and there are only a few left, kick those threads to high priority to make sure they get a chance to finish up.

    Also with work stealers the problem can only happen with the very last task on each worker, so it is really greatly diminished.

    ReplyDelete
  10. Super-annoyingly, Windows could just move the thread that interrupted me to another core (since they are all idle), and then everything will finish, but I can't force that to happen.

    Or it could just move the thread that's blocked to another core! That's what it wants to do, but you've hosed it by setting the thread affinity.

    What happens if you set each thread's ideal processor to its current affinity setting, and don't set affinty? Or what if you set affinity for pairs of threads to both pairs of cores, or set each thread to be ideal X and affinity to X and X+1?

    ReplyDelete
  11. For my reference, this is Farrar's post on this issue :

    http://farrarfocus.blogspot.com/2010/01/pc-cpu-task-parallelism-limits.html

    ReplyDelete