12/21/2012

12-21-12 - Coroutine-centric Architecture

I've been talking about this for a while but maybe haven't written it all clearly in one place. So here goes. My proposal for a coroutine-centric architecture (for games).

1. Run one thread locked to each core.

(NOTE : this is only appropriate on something like a game console where you are in control of all the threads! Do not do this on an OS like Windows where other apps may also be locking to cores, and you have the thread affinity scheduler problems, and so on).

The one-thread-per-core set of threads is your thread pool. All code runs as "tasks" (or jobs or whatever) on the thread pool.

The threads never actually do ANY OS Waits. They never switch. They're not really threads, you're not using any of the OS threading any more. (I suppose you still are using the OS to handle signals and such, and there are probably some OS threads that are running which will grab some of your time, and you want that; but you are not using the OS threading in your code).

2. All functions are coroutines. A function with no yields in it is just a very simple coroutine. There's no special syntax to be a coroutine or call a coroutine.

All functions can take futures or return futures. (a future is just a value that's not yet ready). Whether you want this to be totally implicit or not is up to your taste about how much of the operations behind the scenes are visible in the code.

For example if you have a function like :


int func(int x);

and you call it with a future<int> :

future<int> y;
func(y);

it is promoted automatically to :

future<int> func( future<int> x )
{
    yield x;
    return func( x.value );
}

When you call a function, it is not a "branch", it's just a normal function call. If that function yields, it yields the whole current coroutine. That is, it's just like threading and waits, but rather with coroutines and yields.

To branch I would use a new keyword, like "start" :


future<int> some_async_func(int x);

int current_func(int y)
{

    // execution will step directly into this function;
    // when it yields, current_func will yield

    future<int> f1 = some_async_func(y);

    // with "start" a new coroutine is made and enqueued to the thread pool
    // my coroutine immediately continues to the f1.wait
    
    future<int> f2 = start some_async_func(y);
    
    return f1.wait();
}

"start" should really be an abbreviation for a two-phase launch, which allows a lot more flexibility. That is, "start" should be a shorthand for something like :


start some_async_func(y);

is

coro * c = new coro( some_async_func(y); );
c->start();

because that allows batch-starting, and things like setting dependencies after creating the coro, which I have found to be very useful in practice. eg :

coro * c[32];

for(i in 32)
{
    c[i] = new coro( );
    if ( i > 0 )
        c[i-1]->depends( c[i] );
}

start_all( c, 32 );

Batch starting is one of those things that people often leave out. Starting tasks one by one is just like waiting for them one by one (instead of using a wait_all), it causes bad thread-thrashing (waking up and going back to sleep over and over, or switching back and forth).

3. Full stack-saving is crucial.

For this to be efficient you need a very small minimum stack size (4k is probably good) and you need stack-extension on demand.

You may have lots of pending coroutines sitting around and you don't want them gobbling all your memory with 64k stacks.

Full stack saving means you can do full variable capture for free, even in a language like C where tracking references is hard.

4. You stop using the OS mutex, semaphore, event, etc. and instead use coroutine variants.

Instead of a thread owning a lock, a coroutine owns a lock. When you block on a lock it's a yield of the coroutine instead a full OS wait.

Getting access to a mutex or semaphore is an event that can trigger coroutines being run or resumed. eg. it's a future just like the return from an async procedural call. So you can do things like :


future<int> y = some_async_func();

yield( y , my_mutex.when_lock() );

which yields your coroutine until the joint condition is met that the async func is done AND you can get the lock on "my_mutex".

Joint yields are very important because they prevent unnecessary coroutine wakeup. While coroutine thrashing is not nearly as bad as thread thrashing (and is one of the big advantages of coroutine-centric architecture (in fact perhaps the biggest)).

You must have coroutine versions of all the ops that have delays (file IO, networking, GPU, etc) so that you can yield on them instead of doing thread-waits.

5. You must have some kind of GC.

Because coroutines will constantly be capturing values, you must ensure their lifetime is >= the life of the coroutine. GC is the only reasonable way to do this.

I would also go ahead and put an RW-lock in every object as well since that will be necessary.

6. Dependencies and side effects should be expressed through args and return values.

You really need to get away from funcs like


void DoSomeStuff(void);

that have various un-knowable inputs and outputs. All inputs & outputs need to be values so that they can be used to create dependency chains.

When that's not directly possible, you must use a convention to express it. eg. for file manipulation I recommend using a string containing the file name to express the side effects that go through the file system (eg. for Rename, Delete, Copy, etc.).

7. Note that coroutines do not fundamentally alter the difficulties of threading.

You still have races, deadlocks, etc. Basic async ops are much easier to write with coroutines, but they are no panacea and do not try to be anything other than a nicer way of writing threading. (eg. they are not transactional memory or any other auto-magic).

to be continued (perhaps) ....

Add 3/15/13 : 8. No static size anything. No resources you can run out of. This is another "best practice" that goes with modern thread design that I forgot to list.

Don't use fixed-size queues for thread communication; they seem like an optimization or simplification at first, but if you can ever hit the limit (and you will) they cause big problems. Don't assume a fixed number of workers or a maximum number of async ops in flight, this can cause deadlocks and be a big problem.

The thing is that a "coroutine centric" program is no longer so much like a normal imperative C program. It's moving towards a functional program where the path of execution is all nonlinear. You're setting a big graph to evaluate, and then you just need to be able to hit "go" and wait for the graph to close. If you run into some limit at some point during the graph evaluation, it's a big mess figuring out how to deal with that.

Of course the OS can impose limits on you (eg. running out of memory) and that is a hassle you have to deal with.

12-21-12 - Coroutines From Lambdas

Being pedantic while I'm on the topic. We've covered this before.

Any language with lambdas (that can be fired when an async completes) can simulate coroutines.

Assume we have some async function call :


future<int> AsyncFunc( int x );

which send the integer off over the net (or whatever) and eventually gets a result back. Assume that future<> has a "AndThen" which schedules a function to run when it's done.

Then you can write a sequence of operations like :


future<int> MySequenceOfOps( int x1 )
{
    x1++;

    future<int> f1 = AsyncFunc(x1);

    return f1.AndThen( [](int x2){

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    return f2.AndThen( [](int x3){

    x3 --;

    return x3;

    } );
    } );

}

with a little munging we can make it look more like a standard coroutine :

#define YIELD(future,args)  return future.AndThen( [](args){

future<int> MySequenceOfOps( int x1 )
{
    x1++;

    future<int> f1 = AsyncFunc(x1);

    YIELD(f1,int x2)

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    YIELD(f2,int x3)

    x3 --;

    return x3;

    } );
    } );

}

the only really ugly bit is that you have to put a bunch of scope-closers at the end to match the number of yields.

This is really what any coroutine is doing under the hood. When you hit a "yield", what it does is take the remainder of the function and package that up as a functor to get called after the async op that you're yielding on is done.

Coroutines from lambdas have a few disadvantages, aside from the scope-closers annoyance. It's ugly to do anything but simple linear control flow. The above example is the very simple case of "imperative, yield, imperative, yield" , but in real code you want to have things like :


if ( bool )
{
    YIELD
}

or

while ( some condition )
{

    YIELD
}

which while probably possible with lambda-coroutines, gets ugly.

An advantage of lambda-coroutines is if you're in a language where you have lambdas with variable-capture, then you get that in your coroutines.

12/18/2012

12-18-12 - Async-Await ; Microsoft's Coroutines

As usual I'm way behind in knowing what's going on in the world. Lo and behold, MS have done a coroutine system very similar to me, which they are now pushing as a fundamental technology of WinRT. Dear lord, help us all. (I guess this stuff has been in .NET since 2008 or so, but with WinRT it's now being pushed on C++/CX as well)

I'm just catching up on this, so I'm going to make some notes about things that took a minute to figure out. Correct me where I'm wrong.

For the most part I'll be talking in C# lingo, because this stuff comes from C# and is much more natural there. There are C++/CX versions of all this, but they're rather more ugly. Occasionally I'll dip into what it looks like in CX, which is where we start :

1. "hat" (eg. String^)

Hat is a pointer to a ref-counted object. The ^ means inc and dec ref in scope. In cbloom code String^ is StringPtr.

The main note : "hat" is a thread-safe ref count, *however* it implies no other thread safety. That is, the ref-counting and object destruction is thread safe / atomic , but derefs are not :


Thingy^ t = Get(); // thread safe ref increment here
t->var1 = t->var2; // non-thread safe var accesses!

There is no built-in mutex or anything like that for hat-objects.

2. "async" func keyword

Async is a new keyword that indicates a function might be a coroutine. It does not make the function into an asynchronous call. What it really is is a "structify" or "functor" keyword (plus a "switch") . Like a C++ lambda, the main thing the language does for you is package up all the local variables and function arguments and put them all in a struct. That is (playingly rather loosely with the translation for brevity) :


async void MyFunc( int x )
{
    string y;

    stuff();
}

[ is transformed to : ]

struct MyFunc_functor
{
    int x;
    string y;

    void Do() { stuff(); }
};

void MyFunc( int x )
{
    // allocator functor object :
    MyFunc_functor * f = new MyFunc_functor();
    // copy in args :
    f->x = x;
    // run it :
    f->Do();
}

So obviously this functor that captures the function's state is the key to making this into an async coroutine.

It is *not* stack saving. However for simple usages it is the same. Obviously crucial to this is using a language like C# which has GC so all the references can be traced, and everything is on the heap (perhaps lazily). That is, in C++ you could have pointers and references that refer to things on the stack, so just packaging up the args like this doesn't work.

Note that in the above you didn't see any task creation or asynchronous func launching, because it's not. The "async" keyword does not make a function async, all it does is "functorify" it so that it *could* become async. (this is in contrast to C++11 where "async" is an imperative to "run this asynchronously").

3. No more threads.

WinRT is pushing very hard to remove manual control of threads from the developer. Instead you have an OS thread pool that can run your tasks.

Now, I actually am a fan of this model in a limitted way. It's the model I've been advocating for games for a while. To be clear, what I think is good for games is : run 1 thread per core. All game code consists of tasks for the thread pool. There are no special purpose threads, any thread can run any type of task. All the threads are equal priority (there's only 1 per core so this is irrelevant as long as you don't add extra threads).

So, when a coroutine becomes async, it just enqueues to a thread pool.

There is this funny stuff about execution "context", because they couldn't make it actually clean (so that any task can run any thread in the pool); a "context" is a set of one or more threads with certain properties; the main one is the special UI context, which only gets one thread, which therefore can deadlock. This looks like a big mess to me, but as long as you aren't actually doing C# UI stuff you can ignore it.

See ConfigureAwait etc. There seems to be lots of control you might want that's intentionally missing. Things like how many real threads are in your thread pool; also things like "run this task on this particular thread" is forbidden (or even just "stay on the same thread"; you can only stay on the same context, which may be several threads).

4. "await" is a coroutine yield.

You can only use "await" inside an "async" func because it relies on the structification.

It's very much like the old C-coroutines using switch trick. await is given an Awaitable (an interface to an async op). At that point your struct is enqueued on the thread pool to run again when the Awaitable is ready.

"await" is a yield, so you may return to your caller immediately at the point that you await.

Note that because of this, "async/await" functions cannot have return values (* except for Task which we'll see next).

Note that "await" is the point at which an "async" function actually becomes async. That is, when you call an async function, it is *not* initially launched to the thread pool, instead it initially runs synchronously on the calling thread. (this is part of a general effort in the WinRT design to make the async functions not actually async whenever possible, minimizing thread switches and heap allocations). It only actually becomes an APC when you await something.

(aside : there is a hacky "await Task.Yield()" mechanism which kicks off your synchronous invocation of a coroutine to the thread pool without anything explicit to await)

I really don't like the name "await" because it's not a "wait" , it's a "yield". The current thread does not stop running, but the current function might be enqueued to continue later. If it is enqueued, then the current thread returns out of the function and continues in the calling context.

One major flaw I see is that you can only await one async; there's no yield_all or yield_any. Because of this you see people writing atrocious code like :

await x;
await y;
await z;
stuff(x,y,z);
Now they do provide a Task.WhenAll and Task.WhenAny , which create proxy tasks that complete when the desired condition is met, so it is possible to do it right (but much easier not to).

Of course "await" might not actually yield the coroutine; if the thing you are awaiting is already done, your coroutine may continue immediately. If you await a task that's not done (and also not already running), it might be run immediately on your thread. They intentionally don't want you to rely on any certain flow control, they leave it up to the "scheduler".

5. "Task" is a future.

The Task< > template is a future (or "promise" if you like) that provides a handle to get the result of a coroutine when it eventually completes. Because of the previously noted problem that "await" returns to the caller immediately, before your final return, you need a way to give the caller a handle to that result.

IAsyncOperation< > is the lower level C++/COM version of Task< > ; it's the same thing without the helper methods of Task.

IAsyncOperation.Status can be polled for completion. IAsyncOperation.GetResults can only be called after completed. IAsyncOperation.Completed is a callback function you can set to be run on completion. (*)

So far as I can tell there is no simple way to just Wait on an IAsyncOperation. (you can "await"). Obviously they are trying hard to prevent you from blocking threads in the pool. The method I've seen is to wrap it in a Task and then use Task.Wait()

(* = the .Completed member is a good example of a big annoyance : they play very fast-and-loose with documenting the thread safety semantics of the whole API. Now, I presume that for .Completed to make any sense it must be a thread-safe accessor, and it must be atomic with Status. Otherwise there would be a race where my completion handler would not get called. Presumably your completion handler is called once and only once. None of this is documented, and the same goes across the whole API. They just expect it all to magically work without you knowing how or why.)

(it seems that .NET used to have a Future< > as well, but that's gone since Task< > is just a future and having both is pointless (?))


So, in general if I read it as :


"async" = "coroutine"  (hacky C switch + functor encapsulation)

"await" = yield

"Task" = future

then it's pretty intuitive.


What's missing?

Well there are some places that are syntactically very ugly, but possible. (eg. working with IAsyncOperation/IAsyncInfo in general is super ugly; also the lack of simple "await x,y,z" is a mistake IMO).

There seems to be no way to easily automatically promote a synchronous function to async. That is, if you have something like :


int func1(int x) { return x+1; }

and you want to run it on a future of an int (Task< int >) , what you really want is just a simple syntax like :

future<int> x = some async func that returns an int

future<int> y = start func1( x );

which makes a coroutine that waits for its args to be ready and then runs the synchronous function. (maybe it's possible to write a helper template that does this?)

Now it's tempting to do something like :


future<int> x = some async func that returns an int

int y = func1( await x );

and you see that all the time in example code, but of course that is not the same thing at all and has many drawbacks (it waits immediately even though "y" might not be needed for a while, it doesn't allow you to create async dependency chains, it requires you are already running as a coroutine, etc).

The bigger issue is that it's not a real stackful coroutine system, which means it's not "composable", something I've written about before :
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

Specifically, a coroutine cannot call another function that does the await. This makes sense if you think of the "await" as being the hacky C-switch-#define thing, not a real language construct. The "async" on the func is the "switch {" and the "await" is a "case ". You cannot write utility functions that are usable in coroutines and may await.

To call functions that might await, they must be run as their own separate coroutine. When they await, they block their own coroutine, not your calling function. That is :


int helper( bool b , AsyncStream s )
{
    if ( b )
    {
        return 0;
    }
    else
    {
        int x = await s.Get<int>();
        return x + 10;
    }
}

async Task<int> myfunc1()
{
    AsyncStream s = open it;
    int x = helper( true, s );
    return x;
}

The idea here is that "myfunc1" is a coroutine, it calls a function ("helper") which does a yield; that yields out of the parent coroutine (myfunc1). That does not work and is not allowed. It is what I would like to see in a good coroutine-centric language. Instead you have to do something like :

async Task<int> helper( bool b , AsyncStream s )
{
    if ( b )
    {
        return 0;
    }
    else
    {
        int x = await s.Get<int>();
        return x + 10;
    }
}

async Task<int> myfunc1()
{
    AsyncStream s = open it;
    int x = await helper( true, s );
    return x;
}

Here "helper" is its own coroutine, and we have to block on it. Now it is worth noting that because WinRT is aggressive about delaying heap-allocation of coroutines and is aggresive about running coroutines immediately, the actual flow of the two cases is not very different.

To be extra clear : lack of composability means you can't just have something like "cofread" which acts like synchronous fread , but instead of blocking the thread when it doesn't have enough data, it yields the coroutine.

You also can't write your own "cosemaphore" or "comutex" that yield instead of waiting the thread. (does WinRT provide cosemaphore and comutex? To have a fully functional coroutine-centric language you need all that kind of stuff. What does the normal C# Mutex do when used in a coroutine? Block the whole thread?)


There are a few places in the syntax that I find very dangerous due to their (false) apparent simplicity.

1. Args to coroutines are often references. When the coroutine is packaged into a struct and delayed execution, what you get is a non-thread-safe pointer to some shared object. It's incredibly easy to write code like :


async void func1( SomeStruct^ s )
{

    s->DoStuff();
    MoreStuff( s );

}

where in fact every touch of 's' is potentially a race and bug.

2. There is no syntax required to start a coroutine. This means you have no idea if functions are async or not at the call site!


void func2()
{

DeleteFile("b");
CopyFile("a","b");

}

Does this code work? No idea! They might be coroutines, in which case DeleteFile might return before it's done, and then I would be calling CopyFile before the delete. (if it is a coroutine, the fix is to call "await", assuming it returned a Task).

Obviously the problem arises from side effects. In this case the file system is the medium for communicating side effects. To use coroutine/future code cleanly, you need to try to make all functions take all their inputs as arguments, and to return all their effects are return values. Even if the return is not necessary, you must return some kind of proxy to the change as a way of expressing the dependency.

"async void" functions are probably bad practice in general; you should at least return a Task with no data (future< void >) so that the caller has something to wait on if they want to. async functions with side effects are very dangerous but also very common. The fantasy that we'll all write pure functions that only read their args (by value) and put all output in their return values is absurd.


It's pretty bold of them to make this the official way to write new code for Windows. As an experimental C# language feature, I think it's pretty decent. But good lord man. Race city, here we come. The days of software having repeatable outcomes are over!

As a software design point, the whole idea that "async improves responsiveness" is rather disturbing. We're gonna get a lot more trickle-in GUIs, which is fucking awful. Yes, async is great for making tasks that the user *expects* to be slow to run in the background. What it should not be used for is hiding the slowness of tasks that should in fact be instant. Like when you open a new window, it should immediately appear fully populated with all its buttons and graphics - if there are widgets in the window that take a long time to appear, they should be fixed or deleted, not made to appear asynchronously.

The way web pages give you an initial view and then gradually trickle in updates? That is fucking awful and should be used as a last resort. It does not belong in applications where you have control over your content. But that is exactly what is being heavily pushed by MS for all WinRT apps.

Having buttons move around after they first appeared, or having buttons appear after the window first opened - that is *terrible* software.

(Windows 8 is of course itself an example; part of their trick for speeding up startup is to put more things delayed until after startup. You now have to boot up, and then sit there and twiddle your thumbs for a few minutes while it actually finishes starting up. (there are some tricks to reduce this, such as using Task Scheduler to force things to run immediately at the user login event))


Some links :

Jerry Nixon @work Windows 8 The right way to Read & Write Files in WinRT
Task.Wait and �Inlining� - .NET Parallel Programming - Site Home - MSDN Blogs
CreateThread for Windows 8 Metro - Shawn Hargreaves Blog - Site Home - MSDN Blogs
Diving deep with WinRT and await - Windows 8 app developer blog - Site Home - MSDN Blogs
Exposing .NET tasks as WinRT asynchronous operations - Windows 8 app developer blog - Site Home - MSDN Blogs
Windows 8 File access sample in C#, VB.NET, C++, JavaScript for Visual Studio 2012
Futures and promises - Wikipedia, the free encyclopedia
Effective Go - The Go Programming Language
Deceptive simplicity of async and await
async (C# Reference)
Asynchronous Programming with Async and Await (C# and Visual Basic)
Creating Asynchronous Operations in C++ for Windows Store Apps
Asynchronous Programming - Easier Asynchronous Programming with the New Visual Studio Async CTP
Asynchronous Programming - Async Performance Understanding the Costs of Async and Await
Asynchronous Programming - Pause and Play with Await
Asynchronous programming in C++ (Windows Store apps) (Windows)
AsyncAwait Could Be Better - CodeProject
File Manipulation in Windows 8 Store Apps
SharpGIS Reading and Writing text files in Windows 8 Metro

12/06/2012

12-6-12 - Theoretical Oodle Rewrite Continued

So, continuing on the theme of a very C++-ish API with "futures" , ref-counted buffers, etc. :

cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

It occurs to me that this could massively simplify the giant API.

What you do is treat "array data" as a special type of object that can be linearly broken up. (I noted previously about having RW locks in every object and special-casing arrays by letting them be RW-locked in portions instead of always locking the whole buffer).

Then arrays could have two special ways of running async :

1. Stream. A straightforward futures sequence to do something like read-compress-write would wait for the whole file read to be done before starting the compress. What you could do instead is have the read op immediately return a "stream future" which would be able to dole out portions of the read as it completed. Any call that processes data linearly can be a streamer, so "compress" could also return a stream future, and "write" would then be able to write out compressed bits as they are made, rather than waiting on the whole op.

2. Branch-merge. This is less of an architectural thing than just a helper (you can easily write it client-side with normal futures); it takes an array and runs the future on portions of it, rather than running on the whole thing. But having this helper in from the beginning means you don't have to write lots of special case branch-merges to do things like compress large files in several pieces.

So you basically just have a bunch of simple APIs that don't look particularly Async. Read just returns a buffer (future). ReadStream returns a buffer stream future. They look like simple buffer->buffer APIs and you don't have to write special cases for all the various async chains, because it's easy for the client to chain things together as they please.

To be redundant, the win is that you can write a function like Compress() and you write it just like a synchronous buffer-to-buffer function, but it's arguments can be futures and its return value can be a future.

Compress() should actually be a stackful coroutine, so that if the input buffer is a Stream buffer, then when you try to access bytes that aren't yet available in that buffer, you Yield the coroutine (pending on the stream filling).


Functions take futures as arguments and return futures.

Every function is actually run as a stackful coroutine on the worker threadpool.

Functions just look like synchronous code, but things like file IO cause a coroutine Yield rather than a thread Wait.

All objects are ref-counted and create automatic dependency chains.

All objects have built-in RW locks, arrays have RW locks on regions.

Parallelism is achieved through generic Stream and Branch/Merge facilities.

While this all sounds very nice in theory, I'm sure in practice it wouldn't work. What I've found is that every parallel routine I write requires new hacky special-casing to make it really run at full efficiency.

12-6-12 - The Oodle Dilemma

Top two Oodle (potential) customer complaints :

1. The API is too big, it's too complicated. There are too many ways of doing basically the same thing, and too many arguments and options on the functions.

2. I really want to be able to do X but I can't do it exactly the way I want with the API, can you add another interface?

Neither one is wrong.

11/23/2012

11-23-12 - Global State Considered Harmful

In code design, a frequent pattern is that of singleton state machines. eg. a module like "the log" or "memory allocation" which has various attributes you set up that affect its operation, and then subsequent calls are affected by those attributes. eg. things like :


Log_SetOutputFile( FILE * f );

then

Log_Printf( const char * fmt .... );

or :

malloc_setminimumalignment( 16 );

then

malloc( size_t size );

The goal of this kind of design is to make the common use API minimal, and have a place to store the settings (in the singleton) so they don't have to be passed in all the time. So, eg. Log_Printf() doesn't have to pass in all the options associated with logging, they are stored in global state.

I propose that global state like this is the classic mistake of improving the easy case. For small code bases with only one programmer, they are mostly okay. But in large code bases, with multi-threading, with chunks of code written independently and then combined, they are a disaster.

Let's look at the problems :

1. Multi-threading.

This is an obvious disaster and pretty much a nail in the coffin for global state. Say you have some code like :


pcb * previous_callback = malloc_setfailcallback( my_malloc_fail_callback );

void * ptr = malloc( big_size ); 

malloc_setfailcallback( previous_callback );

this is okay single threaded, but if other threads are using malloc, you just set the "failcallback" for them as well during that span. You've created a nasty race. And of course you have no idea whether the failcallback that you wanted is actually set when you call malloc because someone else might change it on another thread.

Now, an obvious solution is to make the state thread-local. That fixed the above snippet, but some times you want to change the state so that other threads are affected. So now you have to have thread-local versions and global versions of everything. This is a viable, but messy, solution. The full solution is :

There's a global version of all state variables. There are also thread-local copies of all the global state. The thread-local copies have a special value that means "inherit from global state". The initial value of all the thread-local state should be "inherit". All state-setting APIs must have a flag for whether they should set the global state or the thread-local state. Scoped thread-local state changes (such as the above example) need to restore the thread-local state to "inherit".

This can be made to work (I'm using for the Log system in Oodle at the moment) but it really is a very large conceptual burden on the client code and I don't recommend it.

There's another way that these global-state singletons are horrible for multi-threading, and that's that they create dependencies between threads that are not obvious or intentional. A little utility function that just calls some simple functions picks up these ties to shared variables and needs synchronization protection with the global state. This is related to :

2. Non-local effects.

The global state makes the functions that use it non-"pure" in a very hidden way. It means that innocuous functions can break code that's very far away from it in hidden ways.

One of the classic disasters of global state is the x87 (FPU) control word. Say you have a function like :


void func1()
{

    set x87 CW

    do a bunch of math that relies on that CW

    func2();

    do more math that relies on CW

    restore CW
}

Even without threading problems (the x87 CW is thread-local under any normal OS), this code has nasty non-local effects.

Some branch of code way out in func2() might rely on the CW being in a certain state, or it might change the CW and that breaks func1().

You don't want to be able to break code very far away from you in a hidden way, which is what all global state does. Particularly in the multi-threaded world, you want to be able to detect pure functions at a glance, or if a function is not pure, you need to be able to see what it depends on.

3. Undocumented and un-asserted requirements.

Any code base with global state is just full of bugs waiting to happen.

Any 3d graphics programmer knows about the nightmare of the GPU state machine. To actually write robust GPU code, you have to check every single render state at the start of the function to ensure that it is set up the way you expect. Good code always expresses (and checks) its requirements, and global state makes that very hard.

This is a big problem even in a single-source code base, but even worse with multiple programmers, and a total disaster when trying to copy-paste code between different products.

Even something like taking a function that's called in one spot in the code and calling it in another spot can be a hidden bug if it relied on some global state that was set up in just the right way in that original spot. That's terrible, as much as possible functions should be self-contained and work the same no matter where they are called. It's sort of like "movement of call site invariance symmetry" ; the action of a function should be determined only by its arguments (as much as possible) and any memory locations that it reads should be as clearly documented as possible.

4. Code sharing.

I believe that global state is part of what makes C code so hard to share.

If you take a code snippet that relies on some specific global state out of its content and paste it somewhere else, it no longer works. Part of the problem is that nobody documents or checks that the global state they need is set. But a bigger issue is :

If you take two chunks of code that work independently and just link them together, they might no longer work. If they share some global state, either intentionally or accidentally, and set it up differently, suddenly they are stomping on each other and breaking each other.

Obviously this occurs with anything in stdlib, or on the processor, or in the OS (for example there are lots of per-Process settings in Windows; eg. if you take some libraries that want a different time period, or process priority class, or priviledge level, etc. etc. you can break them just by putting them together).

Ideally this really should not be so. You should be able to link together separate libs and they should not break each other. Global state is very bad.


Okay, so we hate global state and want to avoid it. What can we do? I don't really have the answer to this because I've only recently come to this conclusion and don't have years of experience, which is what it takes to really make a good decision.

One option is the thread-local global state with inheritance and overrides as sketched above. There are some nice things about the thread-local-inherits-global method. One is that you do still have global state, so you can change the options somewhere and it affects all users. (eg. if you hit 'L' to toggle logging that can change the global state, and any thread or scope that hasn't explicitly sets it picks up the global option immediately).

Other solutions :

1. Pass in everything :

When it's reasonable to do so, try to pass in the options rather than setting them on a singleton. This may make the client code uglier and longer to type at first, but is better down the road.

eg. rather than


malloc_set_alignment( 16 );

malloc( size );

you would do :

malloc_aligned( size , 16 );

One change I've made to Oodle is taking state out of the async systems and putting in the args for each launch. It used to be like :

OodleWork_SetKickImmediate( OodleKickImmediate_No );
OodleWork_SetPriority( OodlePriority_High );
OodleWork_Run( job );

and now it's :

OodleWork_Run( job , OodleKickImmediate_No, OodlePriority_High );

2. An options struct rather than lots of args.

I distinguish this from #3 because it's sort of a bridge between the two. In particular I think of an "options struct" as just plain values - it doesn't have to be cleaned up, it could be const or made with an initializer list. You just use this when the number of options is too large and if you frequently set up the options once and then use it many times.

So eg. the above would be :


OodleWorkOptions wopts = { OodleKickImmediate_No, OodlePriority_High  };
OodleWork_Run( job , &wopts );

Now I should emphasize that we already have given ourselves great power and clarity. The options struct could just be global, and then you have the standard mess with that. You could have it in the TLS so you have per-thread options. And then you could locally override even the thread-local options in some scope. Subroutines should take OodleWorkOptions as a parameter so the caller can control how things inside are run, otherwise you lose the ability to affect child code which a global state system has.

Note also that options structs are dangerous for maintenance because of the C default initializer value of 0 and the fact that there's no warning for partially assigned structs. You can fix this by either making 0 mean "default" for every value, or making 0 mean "invalid" (and assert) - do not have 0 be a valid value which is anything but default. Another option is to require a magic number in the last value of the struct; unfortunately this is only caught at runtime, not compile time, which makes it ugly for a library. Because of that it may be best to only expose Set() functions for the struct and make the initializer list inaccessible.

The options struct can inherit values when its created; eg. it might fill any non-explicitly given values (eg. the 0 default) by inheriting from global options. As long as you never store options (you just make them on the stack), and each frame tick you get back to a root for all threads that has no options on the stack, then global options percolate out at least once a frame. (so for example the 'L' key to toggle logging will affect all threads on the next frame).

3. An initialized state object that you pass around.

Rather than a global singleton for things like The Log or The Allocator, this idea is to completely remove the concept that there is only one of those.

Instead, Log or Allocator is a struct that is passed in, and must be used to do those options. eg. like :


void FunctionThatMightLogOrAllocate( Log * l, Allocator * a , int x , int y )
{
    if ( x )
    {
        Log_Printf( l , "some stuff" );
    }

    if ( y )
    {
        void * p = malloc( a , 32 );

        free( a , p );
    }
}

now you can set options on your object, which may be a per-thread object or it might be global, or it might even be unique to the scope.

This is very powerful, it lets you do things like make an "arena" allocator in a scope ; the arena is allocated from the parent allocator and passed to the child functions. eg :


void MakeSuffixTrie( Allocator * a , U8 * buf, int bufSize )
{

    Allocator_Arena arena( a , bufSize * 4 );

    MakeSuffixTrie_Sub( &arena, buf, bufSize );
}

The idea is there's no global state, everything is passed down.

At first the fact that you have to pass down a state pointer to use malloc seems like an excessive pain in the ass, but it has advantages. It makes it super clear in the signature of a function which subsystems it might use. You get no more surprises because you forgot that your Mat3::Invert function logs about degeneracy.

It's unclear to me whether this would be too much of a burden in real world large code bases like games.

11/13/2012

11-13-12 - Another Oodle Rewrite Note

Of course this is not going to happen. But in the imaginary world in which I rewrite from scratch :

I've got a million (actually several hundred) APIs that start an Async op. All of those APIs take a bunch of standard arguments that they all share, so they all look like :


OodleHandle Oodle_Read_Async(

                // function-specific args :
                OodleIOQFile file,void * memory,SINTa size,S64 position,

                // standard args on every _Async function :
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

The idea was that you pass in everything needed to start the op, and when it's returned you get a fully valid Handle which is enqueued to run.

What I should have done was make all the little _Async functions create an incomplete handle, and then have a standard function to start it. Something like :


// prep an async handle but don't start it :
OodleHandleStaging Oodle_Make_Read(
                OodleIOQFile file,void * memory,SINTa size,S64 position
                );

// standard function to run any op :
OodleHandle Oodle_Start( OodleHandleStaging handle,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

it would remove a ton of boiler-plate out of all my functions, and make it a lot easier to add more standard args, or have different ways of firing off handles. It would also allow things like creating a bunch of "Staging" handles that aren't enqueued yet, and then firing them off all at once, or even just holding them un-fired for a while, etc.

It's sort of ugly to make clients call two functions to run an async op, but you can always get client code that looks just like the old way by doing :


OodleHandle Oodle_Start( Oodle_Make_Read( OodleIOQFile file,void * memory,SINTa size,S64 position ) ,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

and I could easily make macros that make that look like one function call.

Having that interval of a partially-constructed op would also let me add more attributes that you could set on the Staging handle before firing it off.

(for example : when I was testing compresses on enwik, some of the tasks could allocate something like 256MB each; it occurred to me that a robust task system should understand limitting the number of tasks that run at the same time if their usage of some resource exceeds the max. eg. for memory usage, if you know you have 2 GB free, don't run more than 8 of those 256 MB tasks at once, but you could run other non-memory-hungry tasks during that time. (I guess the general way to do that would be to make task groups and assign tasks to groups and then limit the number from a certain group that can run simultaneously))

11/08/2012

11-08-12 - Job System Task Types

I think I've written this before, but a bit more verbosely :

It's useful to have at least 3 classes of task in your task system :

1. "Normal" tasks that want to run on one of the worker threads. These take some amount of time, you don't want them to block other threads, they might have dependencies and create other jobs.

2. "IO" tasks. This is CPU work, but the main thing it does is wait on an IO and perhaps spawn another one. For example something like copying a file through a double-buffer is basically just wait on an IO op then do a tiny bit of math, then start another IO op. This should be run on the IO thread, because it avoids all thread switching and thread communication as it mediates the IO jobs.

(of course the same applies to other subsytems if you have them)

3. "Tiny" / "Run anywhere" tasks. These are tasks that take very little CPU time and should not be enqueued onto worker threads, because the time to wake up a thread for them dwarfs the time to just run them.

The only reason you would run these as async tasks at all is because they depend on something. Generally this is something trivial like "set a bool after this other job is done".

These tasks should be run immediately when they become ready-to-run on whatever thread made them rtr. So they might run on the IO thread, or a worker thread, or the main thread (any client thread). eg. if a tiny task is enqueued on the main thread and its dependencies are already done, it should just be run immediately.

It's possible that class #2 could be merged into class #3. That is, eliminate the IO-tasks (or GPU-tasks or whatever) and just call them all "tiny tasks". You might lose a tiny bit of efficiency from that, but the simplicity of having only two classes of tasks is probably preferable. If the IO tasks are made into just generic tiny tasks, then it's important that the IO thread be able to execute tiny tasks from the generic job system itself, otherwise it might go to sleep thinking there is no IO to be done, when a pending tiny IO task could create new IO work for it.

Okay.

Beyond that, for "normal" tasks there's the question of typical duration, which tells you whether it's worth it to fire up more threads.

eg. say you shoot 10 tasks at your thread-pool worker system. Should you wake up 1 thread and let it do all 10 ? Or wake up 10 threads and give each one a task? Or maybe wake 2?

One issue that still is bugging me is when you have a worker thread, and in doing some work it makes some more tasks ready-to-run. Should it fire up new worker threads to take those tasks, or should it just finish its task and then do them itself? You need two pieces of information : 1. are the new tasks significant enough to warrant a new thread? and 2. how close to the end of my current task am I? (eg. if I'm in the middle of some big work I might want to fire up a new thread even though the new RTR tasks are tiny).

When you have "tiny" and "normal" tasks at the same priority level, it's probably worth running all the tinies before any normals.

Good lord.

11/06/2012

11-06-12 - Using your own malloc with operator new

In Oodle, I use my own allocator, and I wish to be able to still construct & destruct classes. (Oodle's public API is C only, but I use C++ internally).

The traditional way to do this is to write your own "operator new" implementation which will link in place of the library implementation. This way sucks for various reasons. The important one to me is that it changes all the news of any other statically-linked code, which is just not an okay thing for a library to do. You may want to have different mallocs for different purposes; the whole idea of a single global allocator is kind of broken in the modern world.

(the presence of global state in the C standard lib is part of what makes C code so hard to share. The entire C stdlib should be a passed-in vtable argument. Perhaps more on this in a later post.)

Anyway, what I want is a way to do a "new" without interfering with client code or other libraries. It's relatively straightforward (*), but there are a few little details that took me a while to get right, so here they are.

(* = ADDENDUM = not straightforward at all if multiple-inheritance is used and deletion can be done on arbitrary parts of the MI class)

//==================================================================

/*

subtlety : just calling placement new can be problematic; it's safer to make an explicit selection
of placement new.  This is how we call the constructor.

*/

enum EnumForPlacementNew { ePlacementNew };

// explicit access to placement new when there's ambiguity :
//  if there are *any* custom overrides to new() then placement new becomes ambiguous   
inline void* operator new   (size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
inline void  operator delete(void*,  EnumForPlacementNew, void*) { }

#ifdef __STRICT_ALIGNED
// subtlety : some stdlibs have a non-standard operator new with alignment (second arg is alignment)
//  note that the alignment is not getting passed to our malloc here, so you must ensure you are
//    getting it in some other way
inline void* operator new   (size_t , size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
#endif

// "MyNew" macro is how you do construction

/*

subtlety : trailing the arg list off the macro means we don't need to do this kind of nonsense :

    template <typename Entry,typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4,typename t_arg5,typename t_arg6,typename t_arg7,typename t_arg8,typename t_arg9>
    static inline Entry * construct(Entry * pEntry, t_arg1 arg1, t_arg2 arg2, t_arg3 arg3, t_arg4 arg4, t_arg5 arg5, t_arg6 arg6, t_arg7 arg7, t_arg8 arg8, t_arg9 arg9)

*/

//  Stuff * ptr = MyNew(Stuff) (constructor args); 
//  eg. for void args :
//  Stuff * ptr = MyNew(Stuff) ();
#define MyNew(t_type)   new (ePlacementNew, (t_type *) MyMalloc(sizeof(t_type)) ) t_type 

// call the destructor :
template <typename t_type>
static inline t_type * destruct(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    ptr->~t_type();
    return ptr;
}

// MyDelete is how you kill a class

/*

subtlety : I like to use a Free() which takes the size of the object.  This is a big optimization
for the allocator in some cases (or lets you not store the memory size in a header of the allocation).
*But* if you do this, you must ensure that you don't use sizeof() if the object is polymorphic.
Here I use MSVC's nice __has_virtual_destructor() extension to detect if a type is polymorphic.

*/

template <typename t_type>
void MyDeleteNonVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    #ifdef _MSC_VER
    RR_COMPILER_ASSERT( ! __has_virtual_destructor(t_type) );
    #endif
    destruct(ptr);
    MyFree_Sized((void *)ptr,sizeof(t_type));
}

template <typename t_type>
void MyDeleteVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    destruct(ptr);
    // can't use size :
    MyFree_NoSize((void *)ptr);
}

#ifdef _MSC_VER

// on MSVC , MyDelete can select the right call at compile time

template <typename t_type>
void MyDelete(t_type * ptr)
{
    if ( __has_virtual_destructor(t_type) )
    {
        MyDeleteVirtual(ptr);
    }
    else
    {
        MyDeleteNonVirtual(ptr);
    }
}

#else

// must be safe and use the polymorphic call :

#define MyDelete    MyDeleteVirtual

#endif

and the end result is that you can do :

    foo * f = MyNew(foo) ();

    MyDelete(f);

and you get normal construction and destruction but with your own allocator, and without polluting (or depending on) the global linker space. Yay.

11-06-12 - Protect Ad-Hoc Multi-Threading

Code with un-protected multi-threading is bad code just waiting to be a nightmare of a bug.

"ad-hoc" multi-threading refers to sharing of data across threads without an explicit sharing mechanism (such as a queue or a mutex). There's nothing wrong per-se with ad-hoc multi-threading, but too often people use it as an excuse for "comment moderated handoff" which is no good.

The point of this post is : protect your threading! Use name-changes and protection classes to make access lifetimes very explicit and compiler (or at least assert) moderated rather than comment-moderated.

Let's look at some examples to be super clear. Ad-Hoc multi-threading is something like this :


int shared;


thread0 :
{

shared = 7; // no atomics or protection or anything

// shared is now set up

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

print shared;

}

thread1 :
{

shared ++;

}

this code works (assuming that thread creation and waiting has some kind of memory barrier in it, which it usually does), but the hand-offs and synchronization are all ad-hoc and "comment moderated". This is terrible code.

I believe that even with something like a mutex, you should make the protection compiler-enforced, not comment-enforced.

Comment-enforced mutex protection is something like :


struct MyStruct s_data;
Mutex s_data_mutex;
// lock s_data_mutex before touching s_data

That's okay, but comment-enforced code is always brittle and bug-prone. Better is something like :

struct MyStruct s_data_needs_mutex;
Mutex s_data_mutex;
#define MYSTRUCT_SCOPE(name)    MUTEX_IN_SCOPE(s_data_mutex); MyStruct & name = s_data_needs_mutex;

assuming you have some kind of mutex-scoper class and macro. This makes it impossible to accidentally touch the protected stuff outside of a lock.

Even cleaner is to make a lock-scoper class that un-hides the data for you. Something like :

//-----------------------------------

template <typename t_data> class ThinLockProtectedHolder;

template <typename t_data>
class ThinLockProtected
{
public:

    ThinLockProtected() : m_lock(0), m_data() { }
    ~ThinLockProtected() { }    

protected:

    friend class ThinLockProtectedHolder<t_data>;
    
    OodleThinLock   m_lock;
    t_data          m_data;
};

template <typename t_data>
class ThinLockProtectedHolder
{
public:

    typedef ThinLockProtected<t_data>   t_protected;
    
    ThinLockProtectedHolder(t_protected * ptr) : m_protected(ptr) { OodleThinLock_Lock(&(m_protected->m_lock)); }
    ~ThinLockProtectedHolder() { OodleThinLock_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define TLP_SCOPE(t_data,ptr,data)  ThinLockProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//--------
/*

// use like :

    ThinLockProtected<int>  tlpi;
    {
    TLP_SCOPE(int,&tlpi,shared_int);
    shared_int = 7;
    }

*/

//-----------------------------------
Errkay.

So the point of this whole post is that even when you are just doing ad-hoc thread ownership, you should still use a robustness mechanism like this. For example by direct analogy you could use something like :

//=========================================================================

template <typename t_data> class AdHocProtectedHolder;

template <typename t_data>
class AdHocProtected
{
public:

    AdHocProtected() : 
    #ifdef RR_DO_ASSERTS
        m_lock(0), 
    #endif
        m_data() { }
    ~AdHocProtected() { }

protected:

    friend class AdHocProtectedHolder<t_data>;
    
    #ifdef RR_DO_ASSERTS
    U32             m_lock;
    #endif
    t_data          m_data;
};

#ifdef RR_DO_ASSERTS
void AdHoc_Lock( U32 * pb)  { U32 old = rrAtomicAddExchange32(pb,1); RR_ASSERT( old == 0 ); }
void AdHoc_Unlock(U32 * pb) { U32 old = rrAtomicAddExchange32(pb,-1); RR_ASSERT( old == 1 ); }
#else
#define AdHoc_Lock(xx)
#define AdHoc_Unlock(xx)
#endif


template <typename t_data>
class AdHocProtectedHolder
{
public:

    typedef AdHocProtected<t_data>  t_protected;
    
    AdHocProtectedHolder(t_protected * ptr) : m_protected(ptr) { AdHoc_Lock(&(m_protected->m_lock)); }
    ~AdHocProtectedHolder() { AdHoc_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define ADHOC_SCOPE(t_data,ptr,data)    AdHocProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//==================================================================
which provides scoped checked ownership of variable hand-offs without any explicit mutex.

We can now revisit our original example :


AdHocProtected<int> ahp_shared;


thread0 :
{

{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared = 7; // no atomics or protection or anything

// shared is now set up
}

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

{
ADHOC_SCOPE(int,&ahp_shared,shared);
print shared;
}

}

thread1 :
{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared ++;

}

And now we have code which is efficient, robust, and safe from accidents.

old rants