6/21/2012

06-21-12 - Two Alternative Oodles

I want to write about two interesting ideas for async IO systems that I am *not* doing in Oodle.

1. The graph-forwarding automated parallelism system.

The idea here is that you make all your async operations be like flow chart widgets, they have "in" and "out" channels, and then you can draw links and hook up ins to outs.

This creates dependencies automatically, each op depends on everything that feeds its input channels.

So for example you might do :


"c:\junk"  :-> OpenFile :-> fh

(that is, OpenFile takes some string as input and then puts out a file handle named "fh")

fh :->   ReadFile  :->   buf[0-32768]
buf :->
0-32768  :->

(that is, make a ReadFile op that takes the output of the Openfile, and outputs a valid buffer range)

buf[0-32768]  :->  SPU_LZCompress :-> lzSize , compbuf[0-lzSize]
compbuf

(do an LZ compress on the valid buf range and output to compressed buf)

lzSize , compbuf[0-lzSize] :-> WriteFile

etc..

This sets up a chain of operations with dependencies, you can fire it off and then wait on it all to complete. So, in a sense it's nice because you don't have to write code that waits on the completion of each op and fires the next op and so on.

There are various ways you could set up the async chains, obviously GUI tools where you drag lines around are popular for this sort of thing, but I think that's a terrible way to go. You could just have a text markup, or some creation functions that you call to build up the graph.

Another interesting option is to use the "just run the code" method. That is, you make proxy classes for all the variable types and do-nothing functions with the names of the ops (ReadFile, etc.); then you just run this fake imperative code, and all it does is record the calls and the arguments, and uses that to build the graph. That's easy enough to do for code without branches, but that's sort of a trivial case and I'm not sure how to make it work with branches. In fact in general this type of thing sucks bad for code with loops or branches.

Anyway, I think that this method is basically a terrible idea, except for one thing : creating the graph of the entire async operation before doing it can be a huge performance win. It allows you to "see the future" in terms of what the client wants to do, and thus make better scheduling decisions to maximimize utilization of your available computation resources and disk at all times.

In the simplest case, if the client calls a huge Read and the a huge LZcompress after that, that's a dumb non-parallel way to do things, but in the normal imperative Oodle I can't do anything about it, because at the time I get the Read I don't know what's coming after it. If you gave me the graph, I could go, oh hey during the Read I'm not using the CPU at all, and during the big LZCompress I'm not using the disk, so let me break those into smaller pieces and overlap them. Obviously you can schedule IO's around the timeline so that they try to be issued early enough so their results are back when needed. (though even with the full async graph you can't really schedule right unless you know how long the cpu operations are going to take).

There are even more subtle low level ways that not knowing the future gets me. In the whole worker thread system, there are crucial decisions like "should I wake up a new worker thread to take this work item, or wait for one of the existing worker threads to take it?" or "should I signal the worker threads as I make each item, or wait until I make them all?" ; or even just deciding which worker thread should take a task to maximize cache coherence. You cannot possibly get these decisions right without knowing the future.

Anyhoo, I don't think the advantage outweighs the horribleness of writing code this way, so on to the next one :

2. The coroutine auto-yielding system.

What if your file IO was always done from a coroutine, and instead of blocking when it didn't have the bytes needed, it just yielded the coroutine? That would give you fully async file IO (in the sense that you never block a thread just waiting on IO), and you could write code just like plain sync IO.

eg. you would just write :


// in a coroutine :

FILE * fp = fopen("c:\junk","rb");  // yields the coroutine for the async open

int c = fgetc(fp); // might yield for a buffer fill

etc..

This is sort of appealing, it certainly makes it easy to write async IO code. I personally don't really love the fact that the thread yielding is totally hidden, I like major functional operation to be clear in the imperative code.

The real practical problem is that you just can't make this nice in the hacky switch-based C coroutine method that I use. You really need a language that supports coroutines natively.

You can cook up a clunky way of doing this with the C coroutines, something like :


#define cofread(fp,buf,size) \
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(fp,buf,size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }


where COROUTINE_YIELD is my macro that does something like :
  self->state = 7;
  return;
  case 7:

So now you can call cofread() from an Oodle coroutine and it kind of does what we want.

But, because of the fact that we use switches and returns, we can't use any stack variables in the arguments to cofread ; eg :

{
    int size = 16384;

    cofread(fp,buf,size);
}
is no good, if you resume at the YIELD point inside cofread, "size" is gone. (you'd get a case statement skips variable initialization error or something like that).

Basically with the hacky C coroutine method you just can't do funny business like this where you hide the control flow; you have to make the yield points very explicit because they are points where you lose all your stack variables and must recreate them.

Perhaps a larger issue is that if you really were going to go with the full coroutine auto-yielding system, you'd want to be able to yield from inside function calls, not just at the root level of the coroutine. eg. you'd like to call functions that might do file IO or fire off worker tasks, and you want them to be able to yield too. That's not possible unless you have full stack-saving coroutines.

ADDENDUM for clarity :

It's totally trivial to fix the lack of stack saving in a limited way. All I have to do is reserve a few slots in the coroutine struct that cofread can use to store its variables. So cofread becomes :


#define cofread(fp,buf,size) \
    co->m_fp = fp; co->m_buf = buf; co->m_size = size;
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(co->m_fp,co->m_buf,co->m_size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }

and now you really can use cofread within a coroutine, and you can use local variables as arguments to it, and it yields if it can't complete immediately, and that's all nice.

But it's *not* what I really want for this proposal, which is a full transparent system that a client can build their IO on. The problem is that cofread can only be called at the "root" level of a coroutine. That is, because the "yield" is not a true language yield that preserves function call stack, it must be in the base coroutine function.

eg. you can do :


MyCoroutine1( coroutine * co )
{

COROUTINE_START()

   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);

COROUTINE_DONE()

}

That's easy. But you cannnot do :

void MyHelper()
{
   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);
}


MyCoroutine2( coroutine * co )
{

COROUTINE_START()

    MyHelper();

COROUTINE_DONE()

}

and that lack of composability makes it unusable as a general purpose way to do IO.

To be super clear and redundant again - Oodle of course does support and extensively uses coroutine IO, but it is for small tasks that I want to have maximum performance (like, eg. read and decompress a Package), where the limitation of having to write all your yielding code within one function is okay. The idea of proposal #2 is to make a system that is visible to the client and totally transparent, which they could use to write all their game IO.

(ASIDE : there is a way to do this in C++ in theory (but not in practice). What you do is do all your yielding at the coroutine root level still, either using the switch method or the lambda method (doesn't really matter). To do yields inside function calls, what you do is have your IO routines throw a DataNotReady exception. Then at the coroutine root level you catch that exception and yield. When you resume from the yield, you retry the function call and should make it further this time (but might throw again). To do this, all your functions must be fully rewindable, that is they should be exception safe, and should use classes that back out any uncommitted changes on destruction. I believe this makes the idea technically possible, but unusable in reality).

16 comments:

nereus said...

I've mentioned it before but c++11 lambdas make coroutines trivial and fast and code transparent. I've built a framework that implements "multitasking" among a bunch of tasks with yields using them and I think it's fabulous compared to how I learned them using protothreads.

ie

enum {YIELD, DONE};

std::function next_task_;

void task1_() {
next_task_ = [&]() { return task2_(); }; return YIELD; };

void task2_() { return DONE; }

public:
void init() {
next_task_ = [&]() { return task1_(); }
}

int invoke() { return next_task_(); }

Look ma, no threads!

EEK in the preview I see that the template angle brackets get elided. Just imagine the std::function correctly defined with an int return and void arg list.

nereus said...

oops upon code review we note that some voids need to be ints...

gotta run.

cbloom said...

I can't follow your code, but I believe you are smoking crack.

Lambdas let your write async code in the callback style, which is very ugly and undesirable.

You could perhaps use macros to make lambdas look like coroutines. It would be something like this :

lambda mycoroutine()
{
step1;
return []{

step2;
return []{

step3;
}}
}


Basically at each coroutine yield you are wrapping up the remainder of the coroutine as a lambda to get called when the coroutine resumes.

While that has the advantage over the switch hack that it allows local variable capture, it doesn't actually give you a true yield that can be done from inside function calls.

(I also don't see how you can yield inside branches or loops in a clean way with lambdas, which the switch method can; unless there's a solution that which I don't see, the lambda way is actually worse than the switch way).

Note that lack of variable capture in the switch hack is the very *least* of its problems. I don't see how lambdas help with the real major issue, which is yielding from inside subroutines.

nereus said...

It's actually quite easy to write hierarchical state machines this way, and yield transparently from the lowest level. And at every level that [&] gives you not capture at the point of lambda definition, which is not what you want, but access to the full current state at lambda invocation.

I'm guessing by the "switch hack" you're referring to the macroized duffy device that lurks behind the c technique, and yes indeed things get ugly pretty quickly.

If you can't read that code (I should have spent ten minutes writing a nice class, you're right, but frankly I'm more interested in my own code), then you haven't thought it through yet. But you're a smart guy. You will, you will.

One of the reasons I decided to go this way, and for which I will be eternally grateful to you, was reading all your posts on threading. I've been writing complicated threaded code for 15 years, and I thought maybe there was something I wasn't getting, because anything interesting eventually gets too complicated for me to understand, as an enumerated state machine. But nope, given the complexity of the model, threads turn out to be the greatest job protection technique ever invented.

Cheers!

cbloom said...

I think maybe we have a translation difficulty.

Anyhoo, to be clear there are two features of the idea #2 that I consider to be crucial :

1. It looks exactly like ordinary old synchronous code to the client.

2. You can flip a global switch and it *acts* exactly like old synchronous code, in the sense that I can step through it in the debugger and it just goes through line by line, not jumping all over the place in some crazy state machine / callback thing.


Sort of an aside, but I think the current fad of closures and continuations and functional programming is going to lead to a whole lot of horrible garbage code. Good code is linear and imperative.

won3d said...

Well, I prefer begin/end. The type system helps more than some bare int, and manipulating iterator-likes feels better to me than having the separate induction variable for loops since there is an implicit synchronization of state there.

won3d said...

I should say: for things like read/write, you're generally thinking "how big is this buffer" as opposed to arbitrary data ranges, so I'm not surprised (or particularly concerned) about this interface for I/O. But if your concern is about pointer-sised ints and stuff...

As an aside: favorite annoying thing for me lately has been the non-portability of size_t printf format strings.

won3d said...

...and like an idiot I posted to the wrong thread. This has not been a good Friday for me.

cbloom said...

" As an aside: favorite annoying thing for me lately has been the non-portability of size_t printf format strings. "

It's so fucking retarded, and it's one of those things where if you're a compiler writer and you don't make your stdlib match what the other major compilers do, you're just being a dick for no reason. (eg. gcc should obviously support the MSVC formats and vice versa).

This is also one of the little things that makes me dislike working in my RAD code. In my home code I just use autoprintf and I never have this problem.

won3d said...

Yeah, there should be at least a flag or something for GCC. But clearly MSVC should support C99's %zu.

jfb said...

Hi Charles,

What about just switching the stack pointer and backing up a couple registers?

I've been doing this with 0.5KB stacks on a microcontroller (each just a byte array). The switch routine consists of "push callee-saved registers and return pointer, set stack pointer to new stack, pop callee-saved registers, return to return pointer". Very fast and small, can still debug normally.

James

cbloom said...

@jfb Yeah, I've thought about that; I even wrote a post about it

http://cbloomrants.blogspot.com/2011/03/03-12-11-c-coroutines-with-stack.html

then deleted it because all you have to do is change compiler settings and it stops working. And I have to support N compilers on M platforms and it becomes a total disaster.

It would be an awesome project if someone wants to write stack-saving coroutines as an open source library and support all the flavors.

jfb said...

Hmm, I'm only doing it on one compiler (Clang). However it's only based on the platform calling convention so I don't see how a different compiler would break anything -- though I am doing stack switching, not the stack copying you described, so perhaps the issues are different.

When continuing a function I call the helper that switches to that function's stack is all. I can't see how the compiler would know the difference as long as the calling convention is followed?

cbloom said...

BTW I think you may be right; I've been putting off replying until I have some more moments to think about it clearly, but that time keeps staying in the future.

cbloom said...

@jfb - have looked into this a little bit. It's such a huge win for me that I think I will do it for Oodle 1.1 or 1.2 ; you're right that it's basically trivial, but with some minor annoyances :

1. You have to know how to make your compiler not do anything to fuck you up in the optimizer.

(god damn you C standards committee, stop fucking around with weirdo ideas like "concepts" and give me shit I actually need, like a standard way to say "don't optimize across this line" and a standard way to disable all optimization for a function, shit like that that real programmers need every day)

2. You have to know how to turn off any stack-checking or stack-extending type of stuff your compiler may add

I saw a good trick for working with systems that try to make sure your stack pointer is from the stack memory range. What you do is in your main thread you just alloca a huge array, and then use that to allocate your coroutine stacks from; that way they are valid stack pointers, they just happen to all be in the main thread stack.

3. On Windows you have to worry about SEH ; even if you don't intentionally use it, they check its integrity in system calls. It's just a few variables in the TEB/TIB to update, but probably best to just use Fibers on Windows and not try to beat the system.


I see that newer POSIX has this nice "context" stuff that should make this super easy in theory. Of course basically none of the platforms we work on has new POSIX.

(god damn you again C standards committee; give me a way to check for existence of a symbol, like ifexist(setcontext) )

Jesse James Lactin said...

I know I'm 5 years LTTP, but I'm considering doing just that (implementing stackful coroutines). It's something I want as part of a standard library (I'm currently replacing libm) after using [set][get]context under Linux (which is fucked, because it's prototype is no longer standard C, let alone C++, so it's been deprecated).

old rants