3/11/2011

03-11-11 - Worklets , IO , and Coroutines

So I'm working on this issue of combining async CPU work with IO events. I have a little async job queue thing, that I call "WorkMgr" and it runs "Worklets". See previous main post on this topic :

cbloom rants 04-06-09 - The Work Dispatcher

And also various semi-related other posts :
cbloom rants 09-21-10 - Waiting on Thread Events
cbloom rants 09-21-10 - Waiting on Thread Events Part 2
cbloom rants 09-12-10 - The deficiency of Windows' multi-processor scheduler
cbloom rants 04-15-09 - Oodle Page Cache

So I'm happy with how my WorkMgr works for pure CPU work items. It has one worker thread per core, the Worklets can be dependent on other Worklets, and it has a dispatcher to farm out Worklets using lock-free queues and all that.

(ASIDE : there is one major problem that ryg describes well , which is that it is possible for worker threads that are doing work to get swapped out for a very long time while workers on another core that could have CPU time can't find anything to do. This is basically a fundamental issue with not being in full control of the OS, and is related to the "deficiency of Windows' multi-processor scheduler" noted above. BTW this problem is much worse if you lock your threads to cores; because of that I advise that in Windows you should *never* lock your threads to cores, you can use affinity to set the preferred core, but don't use the exclusive mask. Anyway, this is an interesting topic that I may come back to in the future, but it's off topic so let's ignore it for now).

So the funny issues start arising when your work items have dependencies on external non-CPU work. For concreteness I'm going to call this "IO" (File, Network, whatever), but it's just anything that takes an unknown amount of time and doesn't use the CPU.

Let's consider a simple concrete example. You wish to do some CPU work (let's call it A), then fire an IO and wait on it, then do some more CPU work B. In pseduocode form :

WorkletLinear
{
    A();
    h = IO();
    Wait(h);
    B();
}
Now obviously you can just give this to the dispatcher and it would work, but while your worklet is waiting on the IO it would be blocking that whole worker thread.

Currently in my system the way you fix this is to split the task. You make two Worklets, the first does work A and fires the IO, the second does work B and is dependent on the first and the IO. Concretely :


Worklet2
{
    B();    
}

Worklet1
{
    A();
    h = IO();
    QueueWorklet( Worklet2, Dependencies{ h } );
}

so Worklet1 finishes and the worker thread can then do other work if there is anything available. If not, the worker thread goes to sleep waiting for one of the dependencies to be done.

This way works fine, it's what I've been using for the past year or so, but as I was writing some example code it occurred to me that it's just a real pain in the ass to write code this way. It's not too bad here, but if you have a bunch of IO's, like do cpu work, IO, do cpu work, more IO, etc. you have to make a whole chain of functions and get the dependencies right and so on. It's just like writing code for IO completion callbacks, which is a real nightmare way to write IO code.

The thing that struck me is that basically what I've done here is create one of the "ghetto coroutine" systems. A coroutine is a function call that can yield, or a manually-scheduled thread if you like. This split up Worklet method could be written as a state machine :


WorkletStatemachine
{
  if ( state == 0 )
  {
    A();
    h = IO();
    state++; enqueue self{ depends on h };
  }
  else if ( state == 1 )
  {
    B();
  }
}

In this form it's obviously the state machine form of a coroutine. What we really want is to yield after the IO and then be able to resume back at that point when some condition is met. Any time you see a state machine, you should prefer a *true* coroutine. For example, game AI written as a state machine is absolutely a nightmare to work with. Game AI written as simple linear coroutines are very nice :

    WalkTo( box )
    obj = Open( box )
    PickUp( obj )

with implicit coroutine Yields taking place in each command that takes some time. In this way you can write linear code, and when some of your actions take undetermined long amounts of time, the code just yields until that's done. (in real game AI you also have to handle interruptions and such things).

So, there's a cute way to implement coroutines in C using switch :

Protothreads - Lightweight, Stackless Threads in C
Coroutines in C

So one option would be to use something like that. You would put the hidden "state" counter into the Worklet work item struct, and use some macros and then you could write :


WorkletCoroutine
{
  crStart   // macro that does a switch on state

    A();
    h = IO();

  crWait(h,1)  // macro that does re-enqueue self with dependency, state = 1; case 1:

    B();

  crEnd
}

that gives us linear-looking code that actually gets swapped out and back in. Unfortunately, it's not practical because this C-coroutine hack doesn't preserve local variables, is creating weird scopes all over, and just is not actually usable for anything but super simple code. (the switch method gives you stackless coroutines; obvious Worklet can be a class and you could use member variables). Implementing a true (stackful) coroutine system doesn't really seem practical for cross-platform (it would be reasonably easy to do for any one platform, you just have to record the stack in crStart and copy it out in crWait, but it's just too much of a low-level hacky mess that would require intimate knowledge of the quirks of each platform and compiler). (you can do coroutines in Windows with fibers, not sure if that would be a viable solution on Windows because I've always heard "fibers are bad mmkay").

Aside : some links on coroutines for C++ :

Thinking Asynchronously in C++ Composed operations, coroutines and code makeover
Dr Dobbs Cross-Platform Coroutines in C++
COROUTINE (Keld Helsgaun)
Chapter�1.�Boost.Coroutine proposal

The next obvious option is a thread pool. We go ahead and let the work item do IO and put the worker thread to sleep, but when it does that we also fire up a new worker thread so that something can run. Of course to avoid creating new threads all the time you have a pool of possible worker threads that are just sitting asleep until you need them. So you do something like :


WorkletThreadPool
{
  A();
  h = IO();
  TheadPoolWait(h);
  B();
}

TheadPoolWait(h)
{
  number of non-waiting workers --;

  CheckThreadPool();

  Wait(h);

  number of non-waiting workers ++;
  CheckThreadPool();
}

CheckThreadPool();
{
  if ( number of non-waiting workers < desired number of workers &&
    is there any work to do )
  {
    start a new worker from the pool
  }

  if ( number of non-waiting workers > desired number of workers )
  {
    sleep worker to the pool
  }
}

// CheckThreadPool also has to be called any time a work item is added to the queue

or something like that. Desired number of workers would be number of cores typically. You have to be very careful of the details of this to avoid races, though races here aren't the worst thing in the world because they just mean you have not quite the ideal number of worker threads running.

This is a reasonably elegant solution, and on Windows is probably a good one. On the consoles I'm concerned about the memory use overhead and other costs associated with having a bunch of threads in a pool.

Of course if you were Windows only, you should just use the built-in thread pool system. It's been in Windows forever in the form of IO Completion Port handling. New in Vista is much simpler, more elegant thread pool that basically just does exactly what you want a thread pool to do, and is managed by the kernel so it's fast and robust and all that. For example, with the custom system you have to be careful to use ThreadPoolWait() instead of normal OS Wait() and if you can't get nice action when you do something that puts you to sleep in other ways (like locking a mutex or whatever).

Some links on Windows thread pools and the old IO completion stuff :

MSDN Pooled Threads Improve Scalability With New Thread Pool APIs (Vista)
MSDN Thread Pools (Windows) (Vista)
MSDN Thread Pooling (Windows) (old)
MSDN Thread Pool API (Windows) (Vista)
So you need a worker thread pool... - Larry Osterman's WebLog - Site Home - MSDN Blogs
Managed ThreadPool vs Win32 ThreadPool (pre-Vista) - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
Dr Dobbs Multithreaded Asynchronous IO & IO Completion Ports
Concurrent, Multi-Core Programming on Windows and .NET (Part II -- Threading Stephen Toub)
MSDN Asynchronous Procedure Calls (Windows)
Why does Win32 even have Fibers - Larry Osterman's WebLog - Site Home - MSDN Blogs
When does it make sense to use Win32 Fibers - Eric Eilebrecht's blog - Site Home - MSDN Blogs
Using fibers to simplify enumerators, part 3 Having it both ways - The Old New Thing - Site Home - MSDN Blogs

So I've rambled a while and don't really have a point. The end.

6 comments:

castano said...

I had a brief experience using windows fibers to work around limitations of the CUDA asynchronous API.

Each CUDA stream defines a sequence of asynchronous operations that need to be completed in order, so that in theory multiple streams could run in parallel. Ideally you would like to express the operations of each stream sequentially:

// stream 0:
cudaMemcpyAsync(..., 0)
cudaConfigureCall(..., 0), cudaLaunch(...)
cudaMemcpyAsync(..., 0)

// stream 1:
cudaMemcpyAsync(..., 1)
cudaConfigureCall(..., 1), cudaLaunch(...)
cudaMemcpyAsync(..., 1)

but in practice, the streams are only executed in parallel if the code cuda calls are interleaved, something like:

cudaMemcpyAsync(..., 0)
cudaConfigureCall(..., 0), cudaLaunch(...)
cudaMemcpyAsync(..., 1)
cudaConfigureCall(..., 1), cudaLaunch(...)
cudaMemcpyAsync(..., 0)
cudaMemcpyAsync(..., 1)

So, you have to manually schedule your code, which is completely fragile, since the duration of the kernels and the copies is not necessarily known.

Using fibers I could define each stream sequentially in the run method of the fiber. Each asynchronous calls would yield to a custom scheduler that decided what fiber to run next.

It was very trivial to get it to work, the only problem was that it was confusing to debug, you would be stepping through the code in one stream and as you step over an asynchronous call it would jump to completely different fiber, but other than that it seemed fairly robust.

This was just a toy example, so maybe in practice there are other problems.

jfb said...

I wonder if you could use macros and C-like 'variables must be declared at the beginning' to keep a counter of 'local variable stack space needed', on a declaration end macro malloc and fix-up all the pointers to be relative to the malloc location... that'd effectively be like member variables but mostly automated..

cbloom said...

@jfb -

there's no need to fix up pointers, you can just memcpy in and out of your stack, since it will only work for basic C types anyway.

I'll do a followup post with code.

I think the member variable method is pretty much 100% preferable though.

jfb said...

Yeah, copying works too. That's how Mono Continuations does it (disappointingly, it isn't very reliable with their garbage collector or soft debugger -- a great feature to 'embrace and extend' MS.NET, but they really need to make it _work reliably_...).

I was thinking you could avoid copying by having the macro resolve to (essentially)

int& a = *(int*)(state->data + 0);
int& b = etc...

and each macro adding to some 'offset' int local the sizeof() the current variable being declared to generate these lines.

cbloom said...

Yeah, but the member variable method is just 100% superior. It lets you use classes and doesn't abuse the language and is much less prone to be buggy.

Jesse James Lactin said...

Coroutines/Fibers are what you want in this situation. Stack switching is the only 100% robust-by-default means of getting this done, um, elegantly (hate how programmers use that word). By elegantly, I mean it works with pointers pointing to variables on the stack without any fiddling. Stack copying just isn't robust, with or without pointer fixups. Macros for doing offset calculations is a code smell IMO.

old rants