5/13/2011

05-13-11 - Avoiding Thread Switches

A very common threading model is to have a thread for each type of task. eg. maybe you have a Physics Thread, Ray Cast thread, AI decision thread, Render Thread, an IO thread, Prefetcher thread, etc. Each one services requests to do a specific type of task. This is good for instruction cache (if the threads get big batches of things to work on).

While this is conceptually simple (and can be easier to code if you use TLS, but that is an illusion, it's not actually simpler than fully reentrant code in the long term), if the tasks have dependencies on each other, it can create very complex flow with lots of thread switches. eg. thread A does something, thread B waits on that task, when it finishes thread B wakes up and does something, then thread A and C can go, etc. Lots of switching.

"Worklets" or mini work items which have dependencies and a work function pointer can make this a lot better. Basically rather than thread-switching away to do the work that depended on you, you do it immediately on your thread.

I started thinking about this situation :

A very simple IO task goes something like this :


Prefetcher thread :

  issue open file A

IO thread :

  execute open file A

Prefetcher thread :

  get size of file A
  malloc buffer of size
  issue read file A into buffer
  issue close file A

IO thread :

  do read on file A
  do close file A

Prefetcher thread :

  register file A to prefetched list

lots of thread switching back and forth as they finish tasks that the next one is waiting on.

The obvious/hacky solution is to create larger IO thread work items, eg. instead of just having "open" and "read" you could make a single operation that does "open, malloc, read, close" to avoid so much thread switching.

But that's really just a band-aid for a general problem. And if you keep doing that you wind up turning all your different systems into "IO thread work items". (eg. you wind up creating a single work item that's "open, read, decompress, parse animation tree, instatiate character"). Yay you've reduced the thread switching by ruining task granularity.

The real solution is to be able to run any type of item on the thread and to immediately execute them. Instead of putting your thread to sleep and waking up another one that can now do work, you just grab his work and do it. So you might have something like :


Prefetcher thread :

  queue work items to prefetch file A
  work items depend on IO so I can't do anything and go to sleep

IO thread :

  execute open file A

  [check for pending prefetcher work items]
  do work item :

  get size of file A
  malloc buffer of size
  issue read file A into buffer
  issue close file A

  do IO thread work :

  do read on file A
  do close file A

  [check for pending prefetcher work items]
  do work item :

  register file A to prefetched list

so we stay on the IO thread and just pop off prefetcher work items that depended on us and were waiting for us to be able to run.

More generally if you want to be super optimal there are complicated issues to consider :

i-cache thrashing vs. d-cache thrashing :

If we imagine the simple conceptual model that we have a data packet (or packets) and we want to do various types of work on it, you could prefer to follow one data packet through its chain of work, doing different types of work (thrashing i-cache) but working on the same data item, or you could try to do lots of the same type of work (good for i-cache) on lots of different data items.

Certainly in some cases (SPU and GPU) it is much better to keep i-cache coherent, do lots of the same type of work. But this brings up another issue :

Throughput vs. latency :

You can generally optimize for throughput (getting lots of items through with a minimum average time), or latency (minimizing the time for any one item to get from "issued" to "done"). To minimize latency you would prefer the "data coherent" model - that is, for a given data item, do all the tasks on it. For maxmimum throughput you generally preffer "task coherent" - that is, do all the data items for each type of task, then move on to the next task. This can however create huge latency before a single item gets out.

ADDENDUM :

Let me say this in another way.

Say thread A is doing some task and when it finishes it will fire some Event (in Windows parlance). You want to do something when that Event fires.

One way to do this is to put your thread to sleep waiting on that Event. Then when the event fires, the kernel will check a list of threads waiting on that event and run them.

But sometimes what you would rather do is to enqueue a function pointer onto that Event. Then you'd like the Kernel to check for any functions to run when the Event is fired and run them immediately on the context of the firing thread.

I don't know of a way to do this in general on normal OS's.

Almost every OS, however, recognizes the value of this type of model, and provides it for the special case of IO, with some kind of IO completion callback mechanism. (for example, Windows has APC's, but you cannot control when an APC will be run, except for the special case of running on IO completion; QueueUserAPC will cause them to just be run as soon as possible).

However, I've always found that writing IO code using IO completion callbacks is a huge pain in the ass, and is very unpopular for that reason.

2 comments:

nereus said...

I am happy that I've found a new interesting (to me) blog.

The ideas in your post aren't new, but they are very very good.

This little bit of design thoughtwork influenced me deeply almost ten years ago:

http://pl.atyp.us/wordpress/?page_id=1277

Wes said...

On the throughput vs. latency thing IRL, traffic lights vs. stop signs are my favorite example.

old rants