3/26/2013

03-26-13 - Simulating Deep Yield with a Wait

I'm becoming increasingly annoyed at my lack of "deep yield" for coroutines.

Any time you are in a work item, if you decide that you can get some more parallelism by doing a branch-merge inside that item, you need deep yield.

Remember you should never ever do an OS wait on a coroutine thread (with normal threads anyway; on a WinRT threadpool thread you can). The reason is the OS wait disables that worker thread, so you have one less. In the worst case, it leads to deadlock, because all your worker threads can be asleep waiting on work items, and with no worker threads they will never get done.

Anyway, I've cooked up a temporary work-around, it looks like this :


I'm in some function and I want to branch-merge

If I'm not on on a worker thread
  -> just do a normal branch-merge, send the work off and use a Wait for completion

If I am on a worker thread :

inc target worker thread count
if # currently live worker threads is < target count
  start a new worker thread (either create or wake from pool)

now do the branch-merge and use OS Wait
dec the target worker thread count


on each worker thread, after completing a work item and before popping more work :
if target worker thread count < currently live count
  stop self (go back into a sleeping state in the pool)

this is basically using OS threads to implement stack-saving deep yield. It's not awesome, but it is okay if deep yield is rare.

8 comments:

Stephan said...

This sounds a bit like the algorithm used by the Go scheduler that Fabian described in a comment to an earlier post: http://cbloomrants.blogspot.de/2013/02/02-23-13-threading-reasoning-behind.html

Could you maybe comment a bit more on how you schedule coroutines on Windows as opposed to other systems without an OS ThreadPool? Have you tried User-Mode Scheduling (http://msdn.microsoft.com/en-us/library/vstudio/dd627187.aspx) or the Concurrency Runtime for C++?

cbloom said...

Yeah, definitely quite similar. It's really the only option unless you are inside the OS.

"how you schedule coroutines on Windows as opposed to other systems without an OS ThreadPool"

I use the same system on every platform, because I'm paranoid about creating differences.

On Windows you could do a better system very easily by just using the OS Fibers on an OS threadpool.

(Win32's threading primitives are amazing; now that I've actually worked with all of them, I think the win32 kernel is a masterpiece compared to Linux/Mach/BSD/etc; unfortunately all the crap piled on top of the kernel in win32 is not so great).

About UMS :

From what I gather, UMS is sort of an upgrade to Fibers. The primary difference is that UMS contexts get their own OS thread context, while Fibers don't. This fixes some bugs in code that uses Fibers, mainly code that calls OS system calls that do thread-local things. (probably the most common is the GetLastError problem).

That is, a UMS thread is like a fiber (so it's lightweight and switches in user-mode) that gets its own copy of the OS thread-info-block.

Stephan said...

Thank you for your reply!

> It's really the only option unless you are inside the OS.

One alternative is to move the coroutine that is about to enter a blocking syscall to another thread (e.g. in a dedicated IO thread pool) and continue to run other queued coroutines on the current primary thread. When the syscall finishes you could push the coroutine back into a queue of a primary thread. This may or may not provide for better cache locality and make life easier for the OS scheduler. It would allow you to treat all kind of waits (blocking syscalls, waits for async network io, waits on coroutine synchronization primitives, etc) similarly.

> On Windows you could do a better system very easily by just using the OS Fibers on an OS threadpool.

Do you mean that you'd use the thread pool queue for queuing fibers and more or less run one fiber per work item? Wouldn't that cost you some performance due to the strict FIFO order of the thread pool queue (or the overhead associated with working around the FIFO order) and the additional user space/kernel transitions?

Unknown said...

After reading a number of your posts on the subject, lately I've been playing around with something a little similar - but instead of firing up new worker threads I fire up new worker fibers (os fibers, or ucontext depending on platform).

The worker fibers just try and pull tasks from a task queue (as a worker thread would normally do) and if the task it is currently running wants it can yield control back to the worker thread. At that point the thread will potentially add the worker to a list of active workers (depending on the reason for the yield) and then wake up an inactive worker fiber (from a pool of inactive fibers) to process remaining tasks. If there are no pending tasks then the thread will process active worker fibers that need updating (e.g. a main game loop) or were rescheduled because whatever condition they were waiting on was signaled (e.g. file read completed). The scheduling is still a little shaky - like I said I am still just playing around with it.

cbloom said...

"One alternative is to move the coroutine that is about to enter a blocking syscall to another thread"

How do you do that when you're 10 levels deep in function calls and don't have manual stack-saving? The whole point of "deep yield" is I want to be able to yield the coroutine from *anywhere*. (I know it's not standard, it's terminology I invented)

If you just want to handle blocking syscalls at the root level of the coroutine, yeah that's a much easier problem (which I currently handle fine by making the syscall into a lambda and giving to my syscall threadpool, basically as you describe).

Stephan said...

> How do you do that when you're 10 levels deep in function calls and don't have manual stack-saving?

I was thinking of a general coroutine system, where you can save and switch the execution context (instruction pointer, stack pointer, etc.), like Windows Fibers. That would allow you to switch out of the coroutine immediately before the blocking call and then switch back into it on a thread pool thread.

cbloom said...

Yeah, that's not exactly what I'm talking about here.

What you're talking about is more like the Go problem, preventing your workers from being stalled in OS calls. In a stack-saving system you don't have the problems I describe in the original post (deadlock due to workers waiting on wait) because you would never use an OS Wait on work, you would just use a deep yield of your stack-saving coroutine.

I'm talking about what you can do to simulate a stack-saving deep yield when you only have simple coroutines like the C-switch hack or even WinRT style where you have to return a continuation out of the root level of your coroutine in order to yield. The original post is about how you can do a branch-merge safely in a system like that.

Stephan said...

You are right, sorry for the confusion. In a system with proper coroutines you don't need an OS wait to wait on other coroutines. The problem of handling blocking syscalls in such a system is quite similar to the problem you described in your post though.

old rants