2/23/2013

02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

cbloom rants 12-21-12 - Coroutine-centric Architecture is a proposed architecture.

Why do I think it should be that way? Let's revisit some points.

1. Main thread should be a worker and all workers should be symmetric. That is, there's only one type of thread - worker threads, and all functions are work items. There are no special-purpose threads.

The purpose of this is to minimize thread switches, and to make waits be immediate runs whenever possible.

Consider the alternative. Say you have a classic "main" thread and a worker thread. Your main thread is running along and then decides it has to Wait() on a work item. It has to sleep the thread pending a notification from the worker thread. The OS has to switch to the worker, run the job, notify, then switch you back.

With fully symmetric threads, there is no actual thread wait there. If the work item is not started, or is in a yield point of a coroutine, you simply pop it and run it immediately. (of course your main() also has to be a coroutine, so that it can be yielded out at that spot to run the work item). Symmetric threads = less thread switching.

There are other advantages. One is that you're less affected by the OS starving one of your threads. When your threads are not symmetric, if one is starved (and is the bottleneck) it can ruin your throughput; one crucial job or IO can't run and then all the other threads back up. With symmetric threads, someone else grabs that job and off you go.

Symmetric threads are self-balancing. Any time you decide "we have 2 threads for graphics and 1 for compute" you are assuming you know your load exactly, and you can only be wrong. Symmetric threads maximize the utilization of the cpu. (Note that for cache coherence you might want to have a system that *prefers* to keep the same time of job on the same thread, but that's only a soft preference and it will run other jobs if there are none of the right type).

Symmetric threads scale cleanly down to 1. This is a big one that I think is important. Even just for debugging purposes, you want to be able to run your system non-threaded. With asymmetric threads you have to have a custom "non-threaded" pathway, which leads to bugs and means you aren't testing the same threaded pathway. The symmetric thread system scales down to 1 thread using the same code as always - when you wait on a job, if it hasn't been started it's just run immediately.

It's also much easier to have deadlocks in asymmetric thread systems. If an IO job waits on a graphics job, and a graphics job waits on an IO job, you're in a tricky situation; of course you shouldn't deadlock as long as there are no circular dependencies, but if one of those threads is processing in FIFO order you can get in trouble. It's just better to have a system where that issue doesn't even arise.

2. Deep yield.

Obviously if you want to write real software, you can't be returning out to the root level of the coroutine every time you want to yield.

In the full coroutine-centric architecture, all the OS waits (mutex locks, etc) should be coroutine yields. The only way to do that is if they can call yield() internally and it's a full stack-staving deep yield.

Of course you should be able to spawn more coroutines from inside your coroutine, and wait on them (with that wait being a yield). That is, aside from the outer branch-merge, each internal operation should be able to do its own branch-merge, and yield its thread to its sub-items.

3. Everything GC.

This is just the only reasonable way to write code in this system. It gives you a race-free way to ensure that object lifetimes exceed their usage.

The last post I did about the simple string crash is just so easy to do. The problem is that without GC you inevitably try to be "clever" and "efficient" (really "dumb" and "pointless") about your lifetime management. That is, you'll write things like :


void func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

Wait(h);
}

which is okay, because it waits on the async op inside the lifetime of "name". But of course a week later you change this function to :

Handle func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

return h;
}

with the wait done externally, and now it's a crash. Manual lifetime management in heavily-threaded code is just not reasonable.

The other compelling reason is that you want to be able to have "dangling" coroutines, that is you don't want to have to wait on them and clean them up on the outside, just fire them off and the clean themselves when they finish. This requires some kind of ref-counted or GC'ed ownership of all objects.

4. A thread per core.

With all your "threading" as coroutines and all your waits as "yields", you no longer need threads to share the cpu time, so you just make one thread per core and leave it there.

I wanted to note an exception to this - OS signals that cannot be converted to yields, such as IO. In this case you still need to do a true OS Wait that would block a thread. This would stop your entire worker from running, so that's not nice.

The solution is to have a separate pool of threads for running the small set of OS functions that do internal thread waits. That is, you convert :


ReadFile( args )

->

yield RunOnThreadPool( ReadFile, args );

this separate pool of threads is in addition to the one per core (or it could just be all one pool, and you make new threads as needed to ensure that #cores of them are running).

7 comments:

Fabian Giesen said...

Again, you might want to check out the Go (www.golang.org) runtime.

They always try to maintain a given number of active worker threads that run "goroutines".

If one of them starts a syscall - any syscall - it might end up waiting or blocking internally. In their syscall wrapper, they just mark the active thread as inside a syscall, which means it ceases being a normal worker, so your active worker count is down. The runtime may then later spawn another worker thread to bring the worker count up to what it should be (whether it actually does depends on how long the syscall takes).

After the syscall is done, you might now have an extra thread. In that case, the syscall thread goes to sleep; it might get swapped back in later to pick up work as another thread enters a syscall.

It's a really nice way for a worker-thread-centric environment to interface with OS primitives and libraries that aren't, with minimal friction.

won3d said...

In other links Charles won't look at:

http://swtch.com/libtask/

...which was done by own of the main Go developers prior to his working on Go. Single-threaded, but the interface is basically the right one.

Go, naturally, does a better job of it.

cbloom said...

WTF.

jfb said...

Charles, have you tried prototyping it with Mono Continuations (http://www.mono-project.com/Continuations)?

Mono's garbage collected, embeddable, and with this primitive what you speak of can be built.

cbloom said...

jfb, the easiest prototype would just be on Windows Fibers, which basically does everything I need (including turning OS waits into yields for me). Combine that with the Oodle "future" system of automatic dependencies and it's the system I propose.

The problem is I can't actually use it, because in my current situation I don't get to do architecture. The client sets the architecture and I have to work in it.

One difficulty with coroutine-centric-design is that it doesn't work great half way.

Like if the client has a few normal threads, and I have coroutine workers, and they are calling me from normal threads, you have put something in the TLS to say "is this a client thread or one of mine" and then everything you do has to check that bool and behave differently. It's a mess and very easy to make deadlocks and other problems.

I'm convinced that the full system is the way to go, and I would try doing a game that way. (I would also use the lock-free weak-reference object table with a builtin RWlock).

Stephan said...

Do you have an opinion on whether its better to move the blocking system call to a new thread (e.g. in an IO thread pool) or instead activate a new worker thread for executing coroutines and do the system call on the current thread?

Can you think of realistic situations where a fixed cap on the number of threads (say, 2 * #processor) in blocking system calls (excluding network IO) could lead to a deadlock?

cbloom said...

I don't see a problem off hand.

You have to ensure that nothing in your system ever does an OS Wait() on other jobs, only Yield. That ensures forward progress. Also obviously you have to ensure no circles in the dependency graph. Also if any jobs poll for completion rather than doing a proper yield, that can lead to livelocks and priority inversions.

old rants