6/21/2012

06-21-12 - Two Alternative Oodles

I want to write about two interesting ideas for async IO systems that I am *not* doing in Oodle.

1. The graph-forwarding automated parallelism system.

The idea here is that you make all your async operations be like flow chart widgets, they have "in" and "out" channels, and then you can draw links and hook up ins to outs.

This creates dependencies automatically, each op depends on everything that feeds its input channels.

So for example you might do :


"c:\junk"  :-> OpenFile :-> fh

(that is, OpenFile takes some string as input and then puts out a file handle named "fh")

fh :->   ReadFile  :->   buf[0-32768]
buf :->
0-32768  :->

(that is, make a ReadFile op that takes the output of the Openfile, and outputs a valid buffer range)

buf[0-32768]  :->  SPU_LZCompress :-> lzSize , compbuf[0-lzSize]
compbuf

(do an LZ compress on the valid buf range and output to compressed buf)

lzSize , compbuf[0-lzSize] :-> WriteFile

etc..

This sets up a chain of operations with dependencies, you can fire it off and then wait on it all to complete. So, in a sense it's nice because you don't have to write code that waits on the completion of each op and fires the next op and so on.

There are various ways you could set up the async chains, obviously GUI tools where you drag lines around are popular for this sort of thing, but I think that's a terrible way to go. You could just have a text markup, or some creation functions that you call to build up the graph.

Another interesting option is to use the "just run the code" method. That is, you make proxy classes for all the variable types and do-nothing functions with the names of the ops (ReadFile, etc.); then you just run this fake imperative code, and all it does is record the calls and the arguments, and uses that to build the graph. That's easy enough to do for code without branches, but that's sort of a trivial case and I'm not sure how to make it work with branches. In fact in general this type of thing sucks bad for code with loops or branches.

Anyway, I think that this method is basically a terrible idea, except for one thing : creating the graph of the entire async operation before doing it can be a huge performance win. It allows you to "see the future" in terms of what the client wants to do, and thus make better scheduling decisions to maximimize utilization of your available computation resources and disk at all times.

In the simplest case, if the client calls a huge Read and the a huge LZcompress after that, that's a dumb non-parallel way to do things, but in the normal imperative Oodle I can't do anything about it, because at the time I get the Read I don't know what's coming after it. If you gave me the graph, I could go, oh hey during the Read I'm not using the CPU at all, and during the big LZCompress I'm not using the disk, so let me break those into smaller pieces and overlap them. Obviously you can schedule IO's around the timeline so that they try to be issued early enough so their results are back when needed. (though even with the full async graph you can't really schedule right unless you know how long the cpu operations are going to take).

There are even more subtle low level ways that not knowing the future gets me. In the whole worker thread system, there are crucial decisions like "should I wake up a new worker thread to take this work item, or wait for one of the existing worker threads to take it?" or "should I signal the worker threads as I make each item, or wait until I make them all?" ; or even just deciding which worker thread should take a task to maximize cache coherence. You cannot possibly get these decisions right without knowing the future.

Anyhoo, I don't think the advantage outweighs the horribleness of writing code this way, so on to the next one :

2. The coroutine auto-yielding system.

What if your file IO was always done from a coroutine, and instead of blocking when it didn't have the bytes needed, it just yielded the coroutine? That would give you fully async file IO (in the sense that you never block a thread just waiting on IO), and you could write code just like plain sync IO.

eg. you would just write :


// in a coroutine :

FILE * fp = fopen("c:\junk","rb");  // yields the coroutine for the async open

int c = fgetc(fp); // might yield for a buffer fill

etc..

This is sort of appealing, it certainly makes it easy to write async IO code. I personally don't really love the fact that the thread yielding is totally hidden, I like major functional operation to be clear in the imperative code.

The real practical problem is that you just can't make this nice in the hacky switch-based C coroutine method that I use. You really need a language that supports coroutines natively.

You can cook up a clunky way of doing this with the C coroutines, something like :


#define cofread(fp,buf,size) \
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(fp,buf,size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }


where COROUTINE_YIELD is my macro that does something like :
  self->state = 7;
  return;
  case 7:

So now you can call cofread() from an Oodle coroutine and it kind of does what we want.

But, because of the fact that we use switches and returns, we can't use any stack variables in the arguments to cofread ; eg :

{
    int size = 16384;

    cofread(fp,buf,size);
}
is no good, if you resume at the YIELD point inside cofread, "size" is gone. (you'd get a case statement skips variable initialization error or something like that).

Basically with the hacky C coroutine method you just can't do funny business like this where you hide the control flow; you have to make the yield points very explicit because they are points where you lose all your stack variables and must recreate them.

Perhaps a larger issue is that if you really were going to go with the full coroutine auto-yielding system, you'd want to be able to yield from inside function calls, not just at the root level of the coroutine. eg. you'd like to call functions that might do file IO or fire off worker tasks, and you want them to be able to yield too. That's not possible unless you have full stack-saving coroutines.

ADDENDUM for clarity :

It's totally trivial to fix the lack of stack saving in a limited way. All I have to do is reserve a few slots in the coroutine struct that cofread can use to store its variables. So cofread becomes :


#define cofread(fp,buf,size) \
    co->m_fp = fp; co->m_buf = buf; co->m_size = size;
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(co->m_fp,co->m_buf,co->m_size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }

and now you really can use cofread within a coroutine, and you can use local variables as arguments to it, and it yields if it can't complete immediately, and that's all nice.

But it's *not* what I really want for this proposal, which is a full transparent system that a client can build their IO on. The problem is that cofread can only be called at the "root" level of a coroutine. That is, because the "yield" is not a true language yield that preserves function call stack, it must be in the base coroutine function.

eg. you can do :


MyCoroutine1( coroutine * co )
{

COROUTINE_START()

   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);

COROUTINE_DONE()

}

That's easy. But you cannnot do :

void MyHelper()
{
   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);
}


MyCoroutine2( coroutine * co )
{

COROUTINE_START()

    MyHelper();

COROUTINE_DONE()

}

and that lack of composability makes it unusable as a general purpose way to do IO.

To be super clear and redundant again - Oodle of course does support and extensively uses coroutine IO, but it is for small tasks that I want to have maximum performance (like, eg. read and decompress a Package), where the limitation of having to write all your yielding code within one function is okay. The idea of proposal #2 is to make a system that is visible to the client and totally transparent, which they could use to write all their game IO.

(ASIDE : there is a way to do this in C++ in theory (but not in practice). What you do is do all your yielding at the coroutine root level still, either using the switch method or the lambda method (doesn't really matter). To do yields inside function calls, what you do is have your IO routines throw a DataNotReady exception. Then at the coroutine root level you catch that exception and yield. When you resume from the yield, you retry the function call and should make it further this time (but might throw again). To do this, all your functions must be fully rewindable, that is they should be exception safe, and should use classes that back out any uncommitted changes on destruction. I believe this makes the idea technically possible, but unusable in reality).

6/19/2012

06-19-12 - Two Learnings

Two things that I got wrong in the past and have only recently come around to what I believe is the right way.

1. Use pointer-sized ints for memory buffer sizes.

When I made the transition to building 32-64-bit cross platform I was really annoyed with the fact that I couldn't just use "int" everywhere any more. To make it easier for myself, I mostly just used int64 for memory buffers, and made a bunch of helpers for myself that returned int instead of size_t and intptr_t , eg. for things like vector.size() and strlen() and so on I used int-returning variants.

The nice thing about using int64 for memory buffer sizes is that your type doesn't change when you build different targets; that makes coding a lot simpler (and removes the possibility of bugs due to struct sizes changing and such).

Only very recently have I come around, and it's probably not for the reason you think. It's because using a separate type for "pointer sized thing" is a good way of documenting functions with types.

In the previous API post I talked about how I like the function arg types to be as self-documenting as possible. It's even better if you can make it a compile error or warning when you mis-use it. So I was looking at API's like :


Oodle_Read( OodleFile * f, void * buffer, S64 size, U64 filePos );

in particular at the call site where you have something like :

S64 i1,i2;

Oodle_Read(f,ptr,i1,i2);

you can't tell what the last two args are just by looking at the call site, and what's worse, you can switch them in the call order and get no warning. That sucks. It's much better to do :

Oodle_Read( OodleFile * f, void * buffer, SINTa size, U64 filePos );

(SINTa is the RAD pointer-sized signed int). It's now clearly documented in the variable types - this is the size of a memory buffer.

If you have an old code base, it's very annoying at first to do this, because you'll have a lot of conversions to do. It only becomes clean once you have changed your whole code base to follow the rules :


SINTA/UINTa - all memory buffer sizes and array sizes

S64/U64 - file positions and sizes

S32/U32 - really not used for much, just enums and modes and flags, not array sizes

Not only does this put more documentation on the variable, it also makes it more clear when you are doing dangerous cast-downs.

eg. when you try to read a whole file into a memory buffer. You get the file size as an S64 and then you need to pass something to malloc, which takes an SINTa. That makes a clear single point where you are doing a questionable cast. Once you've done that one cast, the rest of the code is cast-free which is nice.

Furthermore, I think it's augmented by having helpers for the cast-downs :


U32 oo64to32(U64 x);  // used super rarely, very weird thing to do
U32 ooAto32(UINTa x); // used super rarely, very weird thing to do

UINTa oo64toA(U64 x); // used for file sizes -> memory buffers, somewhat common

There are only three cast-downs needed, and really the first two should almost never be used; if you are using them a lot it's a sign of bad code. The last one is common, and it should do a check to ensure the cast is okay (eg. if using a > 2GB file size on a 32 bit system).

2. Avoid recursive mutexes.

Like many people, I read the wisdom of the ancients (old school threading programmers) who said that "recursive mutexes are of questionable value and lead to dangerous code; prefer non-recursive mutexes" ; I read that and I went "psshaw" and thought they were crotchety dinosaurs. Well, I've come around.

The thing about recursive mutexes is that like much code which is attractive to the novice, they make the trivial case simpler and look cleaner, but they make the hard case much worse, and the hard case if what actually matters.

The trivial case is : object only locks itself, never locks other objects, never can be freed while locked, etc.

But inevitably in real world threaded code you have to deal with the harder case for mutexes, which is : object might have to lock other objects (and they might have to lock it); lock order may be hard to establish; object may want to free itself while locked, etc.

The way that the novice normally makes a thread-safe object with a recursive mutex is to put a lock scoper in every function on that object :


Object::Func(...)
{
  m_mutex.Lock();

  do stuff;

  m_mutex.Unlock();
}

or really :

Object::Func(...)
{
  MUTEX_SCOPER(m_mutex);

  do stuff;
}

and crucially, when you call other member functions on yourself, that takes the mutex again recursively. (we're going to ignore the efficiency cost of taking the mutex many times). At first that way seems nice, it's easy, but then you encounter one of the real world problems and it falls apart.

What's better is to separate the mutex-taking from the actions, so instead you do :


Object::Func_Locked(...)
{
  do stuff;
}

Object::Func_Unlocked(...)
{
  MUTEX_SCOPER(m_mutex);

  Func_Locked();
}

Then all the functionality is in the _Locked functions and they can call each other.

So now you need to do something like call A->Func1 and B->Func2 , and it has to be done "atomically" , eg. with both locked. Then you run into the issue of mutex order of acquisition and possible deadlocks. If you have used the first style where objects lock themselves, then it is impossible for objects to call each other safely. That is, object A can never call object B, because object B might call object A and there's a deadlock. But with the _Locked functions, any _Locked function can call to another _Locked function, and they don't have to worry about deadlock; instead the lock taking is all pushed up and can be done in a standardized order (or even atomically if you have multi-mutex lock acquisiton).

The other thing that non-recursive locks cleans up, is that you know whenever you see a call to Unlock(), that's the *last* call to unlock and it definitely makes the object publicly visible. That is, there is a crucial transition point from "I own this object" to "others can have it", and with the recursive lock method, that transition point is muddied.

For example, consider this simple and common case : something in the object's internal state tells you it should be deleted. You must do that atomically, because if you unlock then delete, the internal state might change, eg. :


Object::Func()
{
  m_mutex.lock();

  if ( m_x > m_y )
     deleteMe = true;

  m_mutex.unlock();

  // *!*

  if ( deleteMe )
    delete this;
}

That code is no good, at the ! point, some other thread might touch object and make the check m_x > m_y no longer be true, and then we would be deleting this object incorrectly, when the invariant is not set. In order to make this work you need a combined "unlock_and_delete" operation. But to do that you need to know that your unlock is the *last* unlock - you can't do that in the recursive mutex style.

Basically the recursive mutex is sort of like optimizing the code that's already fast; it makes the simple case a bit simpler, but it makes the real world hard cases much worse. Better to just start from the beginning with the more robust long term solution.

Once you realize that the advantage of a recursive mutex is an illusion, then the advantages of the non- recursive mutex become appealing. You can implement a non-recursive mutex far more efficiently. It can take only 1 bit of memory. Every object can easily have its own non-recursive mutex and they don't even need to be allocated or initialized at all.

6/18/2012

06-18-12 - Run Logged

A handy batch file :

runlogged.bat :

set args=%*
if "%args%"=="" end.bat
set Now=%TIME%
REM ECHO It's %Now% now
REM remove decimal part of time :
FOR /F "tokens=1 delims=." %%A IN ("%Now%") DO SET Now=%%A
REM ECHO It's %Now% now
REM alternate way :
REM FOR /F %%A IN ('TIME/T') DO SET Now=%%A
REM ECHO It's %Now% now
set str=%DATE%_%Now%_%args%
call make_clean_str %str%
set str=%str:.=_%
if not exist r:\runlogs call md r:\runlogs
set logname=r:\runlogs\%str%.log
call %args% > %logname% 2>&1
call d %logname%
type %logname%

make_clean_str.bat :

set str=%*
set str=%str: =_%
set str=%str::=_%
set str=%str:\=_%
set str=%str:/=_%
set str=%str:;=_%

make_clean_str is handy any time you want to make a string into a file name, it gets rid of illegal characters.

runlogged runs a command line and saves an archive of all runs, with the arg list and the date/time in the file name so they never collide.

Thanks to these two :

vanderwoude Batch files
ss64 Windows CMD Command Syntax

If the internet wasn't such a useless pile of turd, I would completely download both of their sites for my reference library.

I've written before, but repeating : what I really want is a fully journalling OS; every thing I ever run should be fully tracked and completely reproducable; all the input & output files should be saved. Oh well, this is better than nothing.

6/14/2012

06-14-12 - API Design with Auto-Complete

I'm working on tweaking the Oodle API and a few things strike me.

One is that my personal preference for API's has been heavily influenced by my reliance on Browse Info tips and auto complete. Some examples :

1. I like really long function names. I want the function name to tell me as much as possible about the function. If the function has some weird side effect that is not right there in the name, I consider that a failure of the API. I like function names like "OodleLZ_OpenFileReadHeaderDecompressAsync" ; it tells me exactly what it does. I don't care how long it is because after I type a few character I control-space and the rest comes out.

Package_WriteFile_ThenDeleteSource(); 

that's okay

Package_WriteFile(); // also deletes source

WTF crazy unexpected side effect that's not in the name.  Not okay.

Basically any time I have to go to the header or the docs to figure out what a function does, I consider the API a failure. (of course the biggest failure is when I'm using my own code and can't figure out exactly what a function does without going and reading the implementation)

2. Related to that, I like file names and function names to be alphabetically front-loaded. That is, I want the first few characters to distinguish the function as much as possible, because I really want to do a few characters and then be able to ctrl-space it. I hate the fact that I have to start all my function names with Oodle_ because it makes for a lot of typing before I can ctrl-space it. In my ideal world, the first few letters are a summary of the action; like in the previous example I'd rather see "LZDecIO_OpenFileReadHeaderDecompressAsync".

3. For function arguments, I of course rely on the browse info to tell me what the args are. Because I can see the types of the args and their names, I want the types and names to be super descriptive. A well designed function does not need any docs at all, the names of the args and the name of the function tell you everything about it.

3.A. It's very easy to distinguish "in" and "out" args by use of const, when the auto-complete tells you the type of the args. Furthermore in C++ I would always make "in" args be references and "out" args be pointers. Basically anything you can put in the types of the variables is documentation that is visible to me right there when I use the function.

3.B. Because of this I sort of prefer a mess of bool args to enums (*) or flags. (* = an enum just to give the bool names is nice). If I start typing a function and the browse info pops up and tells me the args are

 EnumDir(const char * dirPath,bool recurseDirs,bool caseSensitive,bool filesOnlyNoDirs) 
then I can see that as I type and I know exactly what it does, But if I start typing and the browse info pops up
 EnumDir(const char * dirPath,U32 flags) 
that sucks because now I have to go to the header and try to figure out what the flags are. Basically any API where you have to go read the docs or scan the header, I don't like. Unfortunately the bool way results in really ugly code after you've written it :
  EnumDir(dirPath,true,false,false) 
you have no idea WTF that does, so that sucks. There is a special type of enum which I believe gives the best of both worlds. First of all, the bad type of enum is one that you can't tell what values it has unless you go look at the docs or the header, so like, this is bad :
 EnumDir(const char * dirPath,enum EnumDirOptions options) 
that sucks; but if you just use the enums as a way of naming bools, like :
 EnumDir(const char * dirPath,enum RecurseDirs recurseDirs,enum CaseSensitive caseSensitive,enum FilesOnly filesOnlyNoDirs);

EnumDir(dirPath,RecurseDirs_Yes,CaseSensitive_No,FilesOnly_No); 
then I think you have the best of both worlds, in the sense that reading the function after it's written is totally clear, and you can write the function (with browse info) without looking at docs or headers. This only works if all your enums are reliably just _Yes _No simple enums, if you try to be fancier and make the names like "RecurseDirs_Recursive" or whatever custom non-standard names, it makes it unpredictable.

Okay, now we'll get into some issues that aren't so browse-info related.

4. I've always believed that API's should be obvious about correct or incorrect usage. They should be "wysiwyg" in the sense of, if you read the code written to the API, it should do what intuitive English reading suggests it does. In particular, if it looks right, it should *be* right. There are some ways you can fuck this up in API design :

4.A. Near-synonyms that aren't obvious. Somebody writes code like :

  Widget * w = Widget_Create();
  ...
  Widget_Delete(w);
it's fine, right? Nope. They should have used Widget_Destroy(w) there, Widget_Delete is for something else. That's very bad API, you have near-synonyms that seem to be interchangeable, but aren't, and it leads to code that reads like it should be fine but isn't.

4.B. One of the most insiduous forms of this is API's that will still work if you use them wrong, but fall into super-low-performance mode. Of course most of us are familiar with the terrible 3d graphics API's where just setting some seemingly inoccuous flag suddenly makes all your speed go away (D3D has perhaps gotten a bit better about this in recent years), but it's fucking ridiculous, I shouldn't have to profile my code every time I change a flag on CreateVertexBuffer or spend years learning about the hacks. The fast paths and slow paths should be clearly separated and if I do something slow when using the fast functions, it should be a failure. It's much worse to give it a slow fallback than to just make it fail.

A similar case is when API's have a lot of unclear dependencies to get on the fast path. Trying to think of an example, one that comes to mind is stuff like the Win32 "overlapped" async file stuff. ReadFile can be async or not async depending on whether you pass an overlapped structure (bad - I hate API's that massively change their action in an unclear way), if it is async then the requirements on the buffer and size and such are totally different (must be sector aligned), and both the sync and async act on the exact same object (HANDLE from CreateFile) so that when you are give a file handle you have no idea whether it is a valid async io handle or not. Similarly when you CreateFile it's totally not obvious how to make sure you are making one that's value for async io.

Any time you are thinking about writing docs like "if you call this function and then this one, and if you pass just the right flags and only use them in this specific way, then the entire behavior of the thing changes and does this other thing". WTF? No. That's bad. Make that functionality a separate set of functions.

(Aside whiney rant : sometimes I feel like I'm the only person in the world who has this philosophy : When somebody uses my API wrong, I consider that not their failing, but my failing. What did I do wrong in the API that made them think that was the correct usage? How could I add more compile-time clarity to make incorrect usage be a compile failure? I consider it the API's reponsibility to make it as easy as possible to use correctly, and as hard as possible to use incorrectly. Whenever I use someone's API wrong and I ask "why is this not doing what I expect?", I generally get the response of "duh you're using wrong". When someone hands you a knife with the blade pointed towards you, it's kind of their fault when you cut your hand.)

((though I guess that's just a sub-point of an even bigger whiney rant, which is that I often feel like I'm the only person in the world who actually cares about their own code or takes pride in it. If somebody finds a bug in my code, or points out a flaw, or even tells me that something could be done better, then I want to know, I want to fix it. I believe my code is great and if there is any reason to suspect it is not, I want to remedy it. When I point out bugs or bad algorithms in others' code the response I usually get is "meh". And on a broader level, when people ask about difficult algorithmic questions I'll usually point them at some reference pages or some academic papers, and I sort of suspect that nobody that I've referred things to has actually ever read them; I guess as any professor knows, you have to be satisfied with something like a 1% hit rate. When a smart kid comes to your after class and says "I'd like to learn more about interpretations of quantum mechanics" and you are excited that a kid is actually interested in learning, and you say "sure, I'd love to talk to you about that, here's a good introductory paper to get you started, then come and see me at office hours", chances are you'll never see that kid again)).

When someone uses my API badly it reflects badly on my code. A lot of API's have the very bad design property that using them wrong is much easier than using them right. If people use your API and never check error codes, maybe it's your fault because you made it too hard to check error codes. Correct usage should be as automatic as possible.


Designing API's for yourself, or even your own company, is very different than tossing API's out into the void. When it's for a small group, a certain amount of quirkiness is okay, because you learn the quirks and then they aren't a problem. In fact, as many game companies sadly know, it's much more valuable to leave the quirky code alone, because it's better to have something familiar than to "fix it" and make something that is cleaner but unfamiliar.

A certain amount of fragility is also okay in internal API's, because if you use it wrong then you hit an assert or a crash, and you just fix your usage. In a public API that's not so okay because the crash is down in the library and it's not their code.

The biggest difference is that internal API's can force a use pattern. You can make an API that assumes a certain model for things like memory allocation and lifetime management and such; you can say "use it this way and if you try to use it other ways it won't work". With a public API you can't do that, it has to work decently any way it's used, and different people may have very different ideas.

At RAD an issue I've never really dealt with before is that I have to be careful to design the API so it's good for *me* as well as good for the customers. Normally I would just try to make the API as easy and good to use as possible. In particular, something that I often do in my own code like cblib is to wrap up lots of helper functions that do all the common stuff for you. Personally I really hate "minimal" API's where you have to write the same bunch of code over and over to use them. If I have to use it in a certain way, then just fucking wrap that up for me and put that in a helper.

But there are disadvantages to that for the API maintainer. For one thing, just the bigger the API is, the more work it is, you have to document every entry point, you have to stress test each entry point, and then once it's added you can't remove it without causing chaos.

The other related issue is API's that are dangerous or hard to use. There are some things where bugs in usage are so likely that it's not worth exposing them.

The most obvious example for me in Oodle is all the low level lock free stuff. I personally think there's immense value in that stuff, but it's highly unlikely that we make sales based on it, and it's very hard to use and very easy to get weird crashes with incorrect use. Because of that, it won't be exposed (in the first release anyway), and we're trying to make the visibility to it only through safe API's.

Another example is exposing access to internal structures. It's slightly more efficient to be able to access the Oodle system structs directly, to read or change values, but that's also much more risky. No operating system ever allows that, for example. The safer way is to provide accessors that change the values safely. For example on Win32, any time you get an OS struct, you actually have a copy of the OS structure, not a direct pointer; so first you have to copy the struct out to your own, fiddle with it, then copy it back. This is basically the way I'm trying to go with Oodle in the first rev.

6/13/2012

06-13-12 - MSVC RegExp Find-Replace

This took me a while to figure out so I thought I'd write it down. You can use the MSVC regexp find-rep to match function args and pass them through. This is in VC2005 so YMMV with other versions (yay for fucking retards who randomly change interfaces for very little benefit and throw away huge amounts of value in learned knowledge and familiarity! yay!)

In particular I wanted to change :


Oodle_Wait(blah,true);

->

Oodle_Wait(blah,OodleAsyncHandle_DeleteIfDone)

for any "blah". The way to do this is :

Oodle_Wait\(:b*{.@}:b*,:b*true:b*\)

->

Oodle_Wait(\1,OodleAsyncHandle_DeleteIfDone)

What it means :

\(  is escaped ( character
:b* matches any amount of white space
{}  wrap an expression which we can later refer to with \1,\2,etc.
.@  means match any character, and @ means match as few as possible (don't use .*)

A few things that tripped me up :

1. The replace string is *not* a regexp ; in particular notice there's no \ for the parens; I had \ on the parens in the output and the damn thing just silently refused to do the replace. So that's another hint - if you click "Find" and it works, and you click "Replace" and it just silently does nothing, that might mean that it doesn't like your output string.

2. There's a ":i" special tag that matches a C identifier. (:i is equal to ([a-zA-Z_$][a-zA-Z0-9_$]*) ) You might think that :i is a nice way to match a function argument, but it's not. It only works if the function argument is a simple identifier, it won't match "array[3]" or "obj.member" or anything like that. It would have been nice if they provided a :I or something that matched a complex identifier.

In cblib/chsh, I could have just done


Oodle_Wait`(*,true`)

->

Oodle_Wait`(*,OodleAsyncHandle_DeleteIfDone`)

which, while not remotely as powerful as a full regexp match, I find much more intuitive and easy to use, and it works for 99% of the find-reps that I want.

(in cblib a * in the search string always matches the minimum number of characters, and a * in the replace string means put the chars matched in the search string at the same slot)

MSVC supports a similar kind of simple wild match for searching, but it doesn't seem to support replacing in the simple wild mode, which is too bad.


I'm doing a ton of Find-Replacing trying to clean up the Oodle public API, and it has made it clear to me how fucking awful the find-replace in most of our code editors is.

I wrote before about how "keep case" is an obvious feature that you should have for code find-replace. But there's so much more that you should expect from your find-rep. For example :

1. I frequently want to do things like rename "VFS_" to "OodleVFS_" , but only if it occurs at the beginning of a word (and of course with keep-case as well). So "only at head of word" or "only at tail of word" would be nice.

2. All modern code editors have syntax parsing so they know if words are types, variable names, comments, etc. I should be able to say do this find-replace but only apply it to function names.

An extremely simple "duh" check-box on any find-replace should be "search code" and "search comments". A lot of the time I want to do a find-rep only on code and not comments.

An even more sophisticated type-aware find-rep would let you do things like :

enum MyEnum
{
  red,black
};
I want to find-rep "red" and make it "Oodle_MyEnum_Red" , but only where the word "red" is being used in a variable of type MyEnum.

That sounds like rather a lot to expect of your find-rep but by god no it is not. The computer knows how to do it; if it can compile the code it can do that find-rep very easily. What's outrageous is that a human being has to do it.

3. A very common annoyance for me is accidental repeated find-reps. That is, I'll do something like find-rep "eLZH_" to "OodleLZH_" , but if I accidentally do it twice I get "OodlOodleLZH_" which is something I didn't expect. Almost always when doing these kind of big find-reps, once I fix a word it's done, so these problems could be avoided by having an option to exclude any match which has already been modified in the current find-rep session.

4. Obviously it should have a check box for "ignore whitespace that doesn't affect C". I shouldn't have to use regexp to mark up every spot where there could be benign whitespace in an expression. eg. if I search for "dumb(world)" and ignore C whitespace it should find "dumb ( world )" but not "du mb(world)".

etc. I'm sure if we could wipe out our preconceptions about how fucking lame the find-rep is, lots of ideas would come to mind about what it obviously should be able to do.

I see there are a bunch of commercial "Refactoring" (what a retarded buzz word that is; it's cleaning up code) tools that might do these type of things for you. In my experience those tools tend to be ungodly slow and flakey; part of the problem is they try to incrementally maintain a browse info database, and they always fuck it up. The compiler is plenty fast and I know it gets it right.

6/12/2012

06-12-12 - Another Threading Post Index

Maybe if I used post category tags I wouldn't have to do these by hand. Oh well. Starting from the last index :

(chronological order, so the best stuff is at the end, and early stuff may be corrected in later posts)

2009-02-26 - Low Level Threading - Table of Contents
2009-04-06 - Multi-threaded Allocators
2009-04-06 - The Work Dispatcher
2010-05-29 - Lock Free in x64
2010-07-18 - Mystery - Do Mutexes need More than Acquire-Release -
2010-07-18 - Mystery - Does the Cell PPU need Memory Control -
2010-07-18 - Mystery - Why no isync for Acquire on Xenon -
2010-07-31 - GCC Scheduling Barrier
2010-09-12 - The defficiency of Windows' multi-processor scheduler
2010-09-21 - Waiting on Thread Events Part 2
2010-09-21 - Waiting on Thread Events
2011-03-11 - Worklets , IO , and Coroutines
2011-05-13 - Avoiding Thread Switches
2011-07-06 - Who ordered Condition Variables -
2011-07-08 - Event Count and Condition Variable
2011-07-08 - Who ordered Event Count -
2011-07-09 - LockFree - Thomasson's simple MPMC
2011-07-09 - TLS for Win32
2011-07-10 - Mystery - Do you ever need Total Order (seq_cst) -
2011-07-13 - Good threading design for games
2011-07-14 - ARM Atomics
2011-07-14 - compare_exchange_strong vs compare_exchange_weak
2011-07-14 - Some obscure threading APIs
2011-07-15 - Review of many Mutex implementations
2011-07-16 - Ticket FIFO Mutex
2011-07-17 - Atman's Multi-way Ticket Lock
2011-07-17 - CLH list-based lock
2011-07-17 - Per-thread event mutexes
2011-07-18 - cblib Relacy
2011-07-18 - MCS list-based lock
2011-07-20 - A cond_var that's actually atomic
2011-07-20 - Some condition var implementations
2011-07-20 - Some notes on condition vars
2011-07-24 - A cond_var that's actually atomic - part 2
2011-07-25 - Semaphore from CondVar
2011-07-26 - Implementing Event WFMO
2011-07-29 - A look at some bounded queues
2011-07-29 - Semaphore Work Counting issues
2011-07-29 - Spinning
2011-07-30 - A look at some bounded queues - part 2
2011-07-31 - An example that needs seq_cst -
2011-08-01 - Double checked wait
2011-08-01 - A game threading model
2011-08-01 - Non-mutex priority inversion
2011-08-09 - Threading Links
2011-11-28 - Some lock-free rambling
2011-11-30 - Basic sketch of Worker Thread system with dependencies
2011-11-30 - Some more Waitset notes
2011-12-03 - Worker Thread system with reverse dependencies
2011-12-05 - Surprising Producer-Consumer Failures
2011-12-08 - Some Semaphores
2012-03-06 - The Worker Wake and Semaphore Delay Issue
2012-05-30 - On C++ Atomic Fences
2012-05-31 - On C++ Atomic Fences Part 2
2012-06-01 - On C++ Atomic Fences Part 3

6/01/2012

06-01-12 - On C++ Atomic Fences Part 3

Finally a small note of confusion. There are cases where I think "what I need is a #StoreLoad", but a seq_cst fence doesn't work, and changing the load to an RMW does work. Let's try to look into one of those cases a bit.

I will use an example that I used before , a very simple "futex" (not really) waitset based exchange-mutex. I'm gonna use the exact same code here as I did there, but I'm putting the futex_system ops into the mutex to make it all a bit leaner :


struct futex_mutex3
{
    std::atomic<int> m_state; // mutex locked flag

    // waitset :
    HANDLE          m_handle; // actual wait event
    atomic<int>     m_count;  // waiter count

    /*************/
                
    futex_mutex3() : m_state(0), m_count(0)
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    }
    ~futex_mutex3()
    {
        CloseHandle(m_handle);  
    }

    void lock()
    {
        // try to grab the mutex by exchanging in a 1
        //  if it returns 0, we got the lock
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            // prepare_wait :
            // add one to waiter count
            m_count($).fetch_add(1,mo_acquire);
        
            // (*1)
        
            // double check :
            while ( m_state($).exchange(1,rl::mo_acquire) )
            {
                // wait :
                WaitForSingleObject(m_handle, INFINITE);
            }
            
            // retire_wait :
            m_count($).fetch_add(-1,mo_relaxed);
        }
    }

    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // need #StoreLoad before loading m_count

        // (*2)

        int c = m_count($).load(mo_acquire);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }
};

The code as written does not work. The problem is the interaction of spots *1 and *2.

(mutual exclusion is guaranteed by this code by our actions on m_state , which also provides the necessary acquire & release to make the mutex valid. So when I say it "doesn't work" it means the waitset interaction with the mutex doesn't work, eg. we deadlock by failing to wake a waiter, it's a "missed wakeup" problem).

(irrelevant aside : if you want this to be a real mutex implementation, then the lock() operations on m_state should probably be acq_rel to prevent mutex overlap deadlocks; see this blog post among others; but the purpose of this post is not to make a real mutex, it's to demonstrate an issue, so let's not get bogged down)

In brief, the problem is that the unlocker at *2 can load a waiter count of 0 (and thus not signal), even though the waiter has passed point *1 (and thus count should not be zero).

The bad execution goes like this :


1. thread 0 holds the lock on the mutex

thread 1 calls lock()

2. thread 1 tries to lock the mutex, sees m_state=1 and goes into prepare_wait
3. thread 1 does m_count ++
4. thread 1 tries the exchange again, sees m_state=1 and goes into the wait

thread 0 calls unlock()

5. thread 0 stores a 0 to m_state
6. thread 0 loads m_count and gets 0 (out of date value)

Now, you might think the problem is that the load can act like it hoists above the store. That is, we know the store happens after the exchange (#4), because the exchange didn't see a zero. Therefore #3 (the inc to count) must already have happened. But the load at #6 is seeing the value before #3; sure, that's allowed, the load has no ordering contraint that stops it from moving back in time.

So clearly we want a #StoreLoad between 5 and 6 to prevent that load from backing up. You cannot express that in C++0x and that's what I meant in this original blog post when I said that the C++0x seq_cst fence is not really a StoreLoad and there's no way to really express this kind of StoreLoad in C++0x. Specifically, just adding a seq_cst fence here where you want a StoreLoad does not work :


    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // need #StoreLoad before loading m_count

        std::atomic_thread_fence(mo_seq_cst);

        int c = m_count($).load(mo_acquire);        
    
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }

The problem as I understand it that is a single fence in C++0x doesn't really do what you want. I kind of got at this in Part 2 as well, like you can't just use a fence as a way to publish and have relaxed loads receive it. You need another fence in the receiving thread, so that the fences can "synchronize with" each other. Also if you go back to Part 1 and look at most of the rules about fences, they only provide ordering if they have the right kind of object to connect through; you need something to carry a transitive relationship.

Note : I believe that on every real CPU, putting a MemoryBarrier() there where you want the #StoreLoad would make this code work. This example is actually very similar to the Peterson lock we saw in part 2.

Boiled down, the problem is like this :


A & B initially zero

thread 0 :
    A = 1
    #StoreLoad
    load B

thread 1 :
    B = 1
    #StoreLoad
    load A

It should not be possible for both threads to load 0. Either one or both threads should see a 1. Now if you make both #StoreLoads into atomic_thread_fence(seq_cst) then it works - but not because the fence is a #StoreLoad. It works because the two seq_cst fences must have a definite order against each other in the total order S, and then that provides reference for all the other ops to be "happens before/after" each other.

To be very clear, this code works :


    thread 0:

        A($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$);
        r1($) = B($).load(mo_relaxed);

    thread 1:

        B($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$);
        r2($) = A($).load(mo_relaxed);

    after :

        r1+r2 == 1 or 2 (never 0)

(BTW this is actually the classic WFMO case in semi-disguise; you're waiting on the two conditions A and B to become true; if two separate threads set the conditions, then at least one should see the joint A && B be true, but that only works with the appropriate #StoreLoad; see this blog post )

But what if we remove the need for one of the threads to have a seq_cst fence :


    thread 0:

        A($).store(1,mo_relaxed);
        std::atomic_thread_fence(mo_seq_cst,$); // #StoreLoad ?
        r1($) = B($).load(mo_relaxed);

    thread 1:

        B($).store(1,mo_relaxed);
        // load A will be after store B via StoreStore release ordering
        r2($) = A($).fetch_add(0,mo_acq_rel);

    after :

        r1+r2 == 1 or 2 (never 0)

Here thread 1 uses an RMW to ensure StoreLoad ordering, so we get rid of the fence. This code no longer works. Now the B.load in thread 0 can hoist above the B.store in thread 1. The reason is that the seq_cst fence in thread 0 is not acting as a StoreLoad any more because it has nothing to synchronize against in the other thread.

Take away : atomic_thread_fence(mo_seq_cst) does NOT really act as a #StoreLoad. It can be used in spots where you need a #StoreLoad, but only in the right situation, eg. if it has the right other stuff to synchronize with.

So, getting back to our futex_mutex example, you can use a seq_cst fence at (*2) to act as your storeload, but only if you also have one at (*1) for it synchronize with :


struct futex_mutex3
{
    std::atomic<int> m_state; // mutex locked flag

    // waitset :
    HANDLE          m_handle; // actual wait event
    atomic<int>     m_count;  // waiter count

    /*************/
                
    futex_mutex3() : m_state(0), m_count(0)
    {
        m_handle = CreateEvent(NULL,0,0,NULL);
    }
    ~futex_mutex3()
    {
        CloseHandle(m_handle);  
    }

    void lock()
    {
        // try to grab the mutex by exchanging in a 1
        //  if it returns 0, we got the lock
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            // prepare_wait :
            // add one to waiter count
            m_count($).fetch_add(1,mo_relaxed);
        
            // (*1)
            std::atomic_thread_fence(mo_seq_cst,$);
        
            // double check :
            while ( m_state($).exchange(1,rl::mo_acquire) )
            {
                // wait :
                WaitForSingleObject(m_handle, INFINITE);
            }
            
            // retire_wait :
            m_count($).fetch_add(-1,mo_relaxed);
        }
    }

    void unlock()
    {
        m_state($).store(0,rl::mo_release);

        //notify_one :

        // (*2)
        // need #StoreLoad before loading m_count

        std::atomic_thread_fence(mo_seq_cst,$);

        int c = m_count($).load(mo_acquire);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }
};

That is, the two fences allow you to set up a transitive relationship - like the m_count.load in unlock() is definitely after the fence in unlock, the fence in unlock is after the fence in lock, and the fence in lock is after the m_count.fetch_add ; therefore the m_count.load must see count > 0.

Or, alternatively, if you leave the fence out of lock(), you can fix it just in unlock by changing either the store or the load into an RMW :


variant "apple" :

    void unlock()
    {
        // release is for the mutex innards
        m_state($).exchange(0,rl::mo_acq_rel);

        // #StoreLoad is achieved as a #LoadLoad on m_state

        int c = m_count($).load(mo_relaxed);        
            
        if ( c > 0 )
        {       
            SetEvent(m_handle);
        }
    }

variant "banana" :

    void unlock()
    {
        // release is for the mutex innards
        m_state($).store(0,rl::mo_release);

        // #StoreLoad is achieved as a #StoreStore on m_count :

        int c = m_count($).fetch_add(0,mo_release);  //(*3)
            
        if ( c > 0 )
        {
            SetEvent(m_handle);
        }
    }

(variant "apple" only works if the double check on m_state in lock is acq_rel).

(variant "banana" appears to work if *3 is only mo_relaxed, which is a bit of a mystery! We'll leave that as an exercise for the reader). (update : it doesn't actually work, see comments).

old rants