7/28/2012

07-28-12 - DeUnicode 1.1

Made a few revisions to DeUnicode. Mainly because I got sick of continuing to run into stupid file name problems, you now have the option to even further reduce the set of allowable characters.

To make file names that are (almost) guaranteed to not cause any problems, you can use

DeUnicode -a -c -s *
where -a makes the output 7-bit ascii (instead of 8-bit windows "A" code page (not ANSI), which is the default), -c removes all special characters that are used in my console ("*^;[]()" etc., as well as anything under 32), and -s removes spaces.

Once you do all that, hey you actually get a fucking file name that you can use in the real world. While *some* apps work with some of those characters and *some* command shells work with some of those characters, if you have any of them, it's quite possible you will run somebody else's wild card traverse and it will do unexpected things, which is very very very (very very very) bad. Particularly when deleting or renaming.

usage :

c:\src\deunicode>deunicode -?
DeUnicode v1.1 by cbloom
usage:
 DeUnicode [-opts] <dir>
-r : recurse
-t : special the ("the blah" -> "blah, the")
-c : remove chsh special characters
-s : remove spaces
-a : ascii output [else 'A' code page]
-d : display only (don't do)
-q : quiet
-v : verbose
DeUnicode.zip at cbloom.com

7/23/2012

07-23-12 - Structs are not what you want

I was looking at some stupid struct in my code which was something like
struct Stuff
{
    U16     m_val1;
    U32     m_val2;
    U16     m_val3;
};
because of the order of variables and the standard packing, it wound up taking 12 bytes instead of 8. In this particular case, the types were actually templates or #defined types so I don't know statically that the ordering is fucked up and causing that problem.

It occured to me that the whole C "struct" thing is really not what we want most of the time. 99% of the time I want to say "put all these values in this thing, and I don't care what order". In particular, Mr. Compiler, you can optimize the order to minimize size, or for speed, or whatever.

Now that's not a big deal with just a simple struct, but more generally it's massive.

What if I could say "bag" is some values with certain types and names. Then when I combine bags, you don't duplicate things that are in both. And you can reinterpret bags to other bags as long as they have the needed values.


bag Particle
{
    Vec3    m_pos;
    ColorDW m_color;
};

bag GameObject
{
    Vec3    m_pos;
    String  m_name;
};

bag GameObject_AndColor : GameObject
{
    ColorDW m_color;
};

now a GameObject_AndColor can be passed to any function that wants a Particle because it provides all the needed fields.
alternatively :

bag GameObjectParticle : GameObject, Particle
{
};

only has one m_pos.

Obviously this doesn't work with C functions which are compiled once and get to know the offset to the data in that one compile. You want a compiler that can compile all the logic but leave the variable references unspecified until it is bound to a concrete type.

Now obviously templates sort of address this, but in a much uglier way.

Templates are worse in 2 ways. 1. they don't do any shared compilation, they leave all compilation to the last minute when they know the concrete type they get to work on; conversely bags can do 99% of the compilation up front and just need to specialize the variable addressing per usage. 2. they don't do any static type checking; that is, when you write the template you can't specify in any kind of clean way "you can use this template on anything that provides m_pos and m_color" (C++ concepts were supposed to sort of address this) ; bags provide this very nicely and let you write a template and error-check it before you even apply it to anything.

Obviously templates are more powerful, but not useable in reality because of these problems; they delay compilation too much for widespread use, it makes compilation too slow, and puts the errors in the wrong place (at the use site, not the template itself). Compilation up front is great (I think weakly typed languages are ridiculous disasters).

But in theory bags can do even more. One annoying problem we have at RAD is that factored out functions can be massively slower than macros. There are various reasons for this, but one is that if you have some variables in registers and want to call a function on them, the compiler will almost always do a ton of work to move those variables around that it doesn't need to do (either writing them back to memory or pushing them on the stack, or just moving them to other registers).

With bags in theory you could do something like :


Vec3 pos1;
int i;
ColorDW c;

... some code that sets up pos1 and c ...

bag Particle p = { pos &= pos1, color &= c } ;  

// this is not a copy
// it says that the variables I already have can be treated as a Particle bag

Particle_Move(p);

// Particle_Move generates code that acts directly on my local variables "pos1" and "c" , no copying

The win is that Particle_Move basically acts like a macro, but I get the type-safety and separate compilation error check benefits of an inline function.

Similarly, I shouldn't have to write new code every time I have a bunch of data that I want to use as SOA (structure of arrays) instead of AOS (array of structures). eg. if I have


Vec3 positions[100];
ColorDW colors[100];

I should be able to use Particle_ functions on those, because they provide the values needed to make a valid bag.

Being a bit redundant : the classic way that simple C inheritance fails happens a lot with vertex types in graphics. It goes something like this :


struct Vertex_Pos { Vec3 pos; }

// functs that only need a "pos" member act on a Vertex_Pos

struct Vertex_PosNormal : Vertex_Pos { Vec3 normal; }

// functs that need a pos and normal act on Vertex_PosNormal

struct Vertex_PosColor : Vertex_Pos { Color color; }

// functs that need a pos and color act on Vertex_PosColor

struct Vertex_PosNormalColor : ??

// oh crap, which do I do ?

struct Vertex_PosNormalColor : Vertex_PosNormal { Color color; }
struct Vertex_PosNormalColor : Vertex_PosColor { Vec3 normal; }

// either way is busted
// with bags I should be able to just do :

bag Vertex_PosNormalColor { Vec3 pos; Vec3 normal; Color color; }

// (or either of the above inheritance ways)

and then the bag Vertex_PosNormalColor can be used as a Vertex_PosNormal or Vertex_PosColor without even explicitly specifying any inheritance. (an ideal compiler would also let you specify that as a compile-time constraint on the type).

Now obviously inheritance and virtual functions solve a slightly different problem than bags. They let you act on part of a type without knowing the concrete type. Bags have to be used on the concrete type, just like templates. I guess the fundamental problem with structs (and why they're wrong for some uses) is that they are actually solving this slightly different problem.

Anyhoo.

7/19/2012

07-19-12 - Experimental Futures in Oodle

I don't know if this will ever see the light of day, but it's fucking sexy as hell so here's a sneak peek.

"futures" implemented in C++98 with Oodle :


void example_future_comp_decomp()
{
    future<OodleBufferRC> rawBuf = oodle_readfile("r:\\oodle_example_future_input");
    
    // call :
    // oodle_compress_sync( rawBuf, OodleLZ_Compressor_LZH, OodleLZ_CompressSelect_Fast );
    // but not until rawBuf is done :
    future<OodleBufferRC> compBuf = start_future<OodleBufferRC>( oodle_compress_sync, rawBuf, OodleLZ_Compressor_LZH, OodleLZ_CompressSelect_Fast );
    
    future<const char *> write = start_future<const char*>(oodle_writefile,"r:\\oodle_example_future_comp",compBuf);
        
    future<OodleBufferRC> read_compBuf = start_future<OodleBufferRC>( oodle_readfile, write ); 
    
    future<OodleBufferRC> read_decompBuf = start_future<OodleBufferRC>( oodle_decompress_sync, read_compBuf );
    
    start_future<const char *>( oodle_writefile, "r:\\oodle_example_future_decomp",read_decompBuf);
}

This creates an async chain to read a file, compress it, write it, then read it back in, decompress it, and write out the decompressed bits.

Futures can take either immediates as arguments or other futures. If they take futures as arguments, they enqueue themself to run when their argument is ready (using the forward dependency system). Dependencies are all automatic based on function arguments; it occurs to me that this rather like the way CPU's do scheduling for out-of-order-processing.

(in contrast to idea #1 in Two Alternative Oodles , here we do not get the full async graph in advance, it's given to us as we get commands, that is, we're expected to start running things immediately when we get the command, and we don't get to know what comes next; but, just like in CPU's, the command submission normally runs slightly ahead of execution (unless our pipeline is totally empty), in which case we have a little bit of time gap)

Functions called by the future can either return values or return futures. (eg, in the code above, "oodle_readfile" could just return an OodleBufferRC directly, or it could return a future to one). If a function in a future returns a future, then the returned future replaces the original, and doesn't return to the outer scope until the chain of futures returns a non-future value. That is, this is a way of doing coroutine yields basically; when you want to yield, you instead return a future to the remaining work. (this is like the lambda-style coroutine yield that we touched on earlier). (* - see example at end)

future of course has a wait() method that blocks and returns its value. As long as you are passing futures to other futures, you never have to wait.

You can implement your own wait_all thusly :


int nop(...)
{
    return 0;
}

template <typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4>
future<int> done_when_all( t_arg1 a, t_arg2 b, t_arg3 c, t_arg4 d )
{
    return start_future<int>( nop, a,b,c,d );
}

then call

done_when_all( various futures )->wait();

A few annoying niggles due to use of old C++ :

1. I don't have lambdas so you actually have to define a function body every time you want to run something as a future.

2. I can't induce the return type of a function, so you have to explicitly specify it when you call start_future.

3. I don't have variadic templates so I have to specifically make versions of start_future<> for 0 args, 1 arg, etc. bleh. (though variadic templates are so ugly that I might choose to do it this way anyway).

Otherwise not bad. (well, that is, the client usage is not bad; like most C++ the implementation is scary as shit; also doing this type of stuff in C++ is very heavy on the mallocs (because you have to convert things into different types, and the way you do that is by new'ing something of the desired type), if you are a sane and reasonable person that should not bother you, but I know a lot of people are in fact still bothered by mallocs).

In order to automatically depend on a previous future, you need to take its return value as one of your input arguments. There's also a method to manually add dependencies on things that aren't input args. Another option is to carry over a dependency through a binding function which depends on one type and returns another, but that kind of C++ is not to my liking. (**)

To really use this kind of system nicely, you should make functions whose return value is a compound type (eg. a struct) that contains all its effects. So, for example oodle_writefile returns the name of the file written, because it modifies that object; if you had a function that modified a game object, like say an Actor *, then its return value should include that Actor *, so that you can use that to set up dependency chains. (in real code, oodle_writefile should really return a struct containing the file name and also an error code).

* : example of returning futures to continue the async job :


float test_func_5_1(int x)
{
    Sleep(1);
    return x * (2.0/3.0);
}

future<float> test_func_5_2(int x)
{
    Sleep(1);

    if ( x == 1 )
    {
        // hey in this case I can return my value immediately
        return make_future(0.6f);
    }
    else
    {
        // I need to run another async job to compute my value
        x *= 3;
        return start_future<float>(test_func_5_1,x);
    }
}


then use as :


future<float> f = start_future<float>(test_func_5_2,7);

... do other work ...

float x = f.wait();

does what it does.

This is a necessary building block, it lets you compose operations, but it's an ugly way to write coroutine-style code.

What it is good for is creating more complex functions from simpler functions, like :


future<const char *> oodle_compress_then_writefile(const char *filename, OodleBufferRC rawBuf, OodleLZ_Compressor compressor, OodleLZ_CompressSelect select )
{
    OodleBufferRC compBuf = oodle_compress_sync( rawBuf, compressor, select );

    return start_future<const char*>( oodle_writefile, filename, compBuf );
}

I believe this "future" is much better than the real C++0x std::future, which seems to be missing a lot of features.

** : example of using a binding function to carry over dependencies :


// say I want to run two asyncs :

future<int> f1 = start_future<int>( func1 );

future<float> f2 = start_future<float>( func2 , 7.5f );

// but I want to run func2 after func1
//  due to some dependency that isn't through the return value

// what I can use is a return-to-arg adapter like :

template<typename t1,typename t2>
t1 return1(t1 a,t2 b)
{
    b;
    return a;
}

template<typename t1,typename t2>
future<t1> return1_after2(t1 a,future<t2> b)
{
    return start_future<t1>( return1<t1,t2>, a, b );
}


// then run :


future<int> f1 = start_future<int>( func1 );

future<float> f2 = start_future<float>( func2 , return1_after2(7.5f,f1) );

but like I said previously I hate that kind of crap for the most part. Much better is to use the explicit dependency mechanism, like :

future<int> f1 = start_future<int>( func1 );

future<float> f2 = make_future<float>( func2 , 7.5f );
f2->add_dep(f1);
f2->start();

There is one case where the funny binding mechanism can be used elegantly; that's when you can associate the binding with the actual reason for the dependency. That is, if we require func2 to run after func1, there must be some shared variable that is causing that ordering requirement. Using a binder to associate func1 with that shared variable is a clean way of saying "you can read this var after func1 is done".

7/15/2012

07-15-12 - VideoPlayer and Commonopoly

I wrote this little videoplayer for RAD a while ago. The main purpose of it is not for consumer-style use but for developer diagnosing of video compressors.

In particular, you can give it a list of several videos on the command line, and it plays all of them. It plays them frame-exact by stepping through their frames one by one. Then you can pause and single step forward and back, and you can do split-screen or frame-toggles.

(I've always found that comparing image or video compressors is very misleading if you just look at the output; you have to look at them side-by-side with the originals, or even better do full-frame ping-ponging. For example, x264 generally looks great, but when you do full-frame ping-pongs you can see that they really fuck up the overall luma level and completely change the colors quite a bit (this is largely due to the very shitty YUV space that is standard for video, not anything in x264)).

Anyhoo, videoplayer does some nice things like prefetch and cache lots of frames (it's odd that the mainstream video players don't do this; I have gigs of ram you fuckers, prefetch some god damn video so that I can play over a network without hitching; also keep some previous frames around so I can pause and step backwards a little without seeking!).

download : videoplayer.zip (463k at cbloom.com)

(videoplayer needs radutil.dll which is in RAD Video Tools ; I put it on delay-load so you only need if you want to load a real video (not just images))

Anyhoo, I haven't touched it in a while cuz I'm off video for now. But "videoplayer" can also be pointed at a dir and it treats the images in the dir as frames of a video. I did this originally to load frame-dumps of videos that I couldn't play. For example I would use MPlayer to spit out frames with :

mplayer -benchmark -nosound -vo png:z=6 %1
md %1_frames
call mov *.png %1_frames\
and then point my videoplayer at the dir, and since it can prefetch and all that it can play a video of pngs (which load too slow to play without preloading a bunch).

But I realized the other day that I could use dir-loading to just show image slideshows too, if I use the option to manually set the frame rate to something really low like 0.1 fps. Which brings us around to the title of this post :

I discovered COMMONOPOLY a while ago; I think it's super beautiful ; the guy chooses images well and there's something about that horizontal reflection that really does some magic on the brain. If you full-screen it and stare at the middle and let your eyes defocus a bit, you can get the same image going in each eye and it's like mmm yeah good.

But obviously viewing it on the web sucks balls. So what you do is download all the images with DownloadThemAll, put them in a dir and then point videoplayer at the dir thusly :

videoplayerR.exe -f0.125 -w1 -q -0 -2 -4 -s1 -r t:\commonopoly
and then enjoy.

07-15-12 - libpng DLL Delay Load

cblib for a long time has used libpng , which creates a damn DLL dependency. Annoyingly, that happens at app startup, which means even if you are trying to load a .BMP it will fail to start the app if it can't find the PNG DLL (or, you know, using cblib on totally non-image related stuff). What I've wanted for a long time is to put the PNG DLL load into the PNG code so that it's only done if you actually try to load a PNG.

MS could have very easily put DLL load-on-first use into the thunk libs they make. That would have been nice. (ADDENDUM : I guess they did!)

Anyhoo, you can do it yourself, and it looks like this :

We need CALL_IMPORT from before :


template <typename t_func_type>
t_func_type GetImport( t_func_type * pFunc , const char * funcName, HMODULE hm )
{
    if ( *pFunc == 0 )
    {
        ASSERT( hm != 0 );
        t_func_type fp = (t_func_type) GetProcAddress( hm, funcName );
        // not optional :
        ASSERT_RELEASE_THROW( fp != 0 );
        *pFunc = fp;
    }
    return (*pFunc); 
}

#define CALL_IMPORT(name,hm) (*GetImport(&STRING_JOIN(fp_,name),STRINGIZE(name),hm))

and I'm going to just explicitly load the PNG DLL when I get to LoadPNG :

static HMODULE hm_png = 0;
#define HMODULE_FAILED  (HMODULE) 1

static bool my_png_init()
{
    if ( hm_png ) 
    {
        return ( hm_png == HMODULE_FAILED ) ? false : true;
    }
    
    HMODULE hm_zl = LoadLibrary(ZLIB_NAME);
    if ( hm_zl == 0 )
    {
        lprintf("Couldn't load Zlib (%s)\n",ZLIB_NAME);
        hm_png = HMODULE_FAILED;
        return false;
    }
    
    hm_png = LoadLibrary(PNG_NAME);
    if ( hm_png == 0 )
    {
        lprintf("Couldn't load PNG lib (%s)\n",PNG_NAME);
        hm_png = HMODULE_FAILED;
        return false;
    }
    
    lprintf_v2("Using libpng.\n");
                
    return true;
}

#define CALL_PNG(name)      CALL_IMPORT(name,hm_png)

so now we just have to replace all our png calls with CALL_PNG() calls, like :

    png_ptr = png_create_read_struct(PNG_LIBPNG_VER_STRING, NULL,NULL,NULL);

    ->

    png_ptr = CALL_PNG(png_create_read_struct)(PNG_LIBPNG_VER_STRING, NULL,NULL,NULL);

(obviously you could #define png_create_read_struct CALL_PNG(png_create_read_struct) to make it totally transparent)

Lastly we need to define all the fp_ func pointers with the right signature. This is particularly easy because png.h wraps its function externs with macros (nice, thanks guys), so we can just abuse those macros. png.h contains things like :


PNG_EXPORT(png_voidp,png_get_io_ptr) PNGARG((png_structp png_ptr));

So we can just do :

#define PNG_EXPORT(ret,func)    static ret (PNGAPI * STRING_JOIN(fp_,func))
#define PNGARG(args)            args = 0

and then the png proto is defining our fp for us. (unfortunately png.h sticks "extern" in front of all the protos, so you have to copy out the protos and take off the extern).

So anyhoo there you go, libpng use with delay-load.

BTW this also suggests a way that you can make your DLL very easy to use with delay-load and manual imports. What you should do is provide a header with only your function protos in it, and wrap them with a macro like :


DECLFUNC( ret, name , args );

eg.

DECLFUNC( png_voidp, png_get_io_ptr, (png_structp png_ptr) );

Then a client could just include that header multiple times and change DECLFUNC to various things. For example if you had a header like that, you can look up all the func names at LoadLibrary time, instead of doing each one on first use (this removes a branch from each function call site). eg :

#define DECLFUNC( ret, name , args )    ret (CALLBACK * STRING_JOIN(fp_,name)) args = 0
#include "allfuncs.inc"
#undef DECLFUNC

void ImportAllFuncs()
{
HMODULE hm = LoadLibrary;
#define DECLFUNC( ret, name , args )    STRING_JOIN(fp_,name) = ImportFunc(name,hm)
#include "allfuncs.inc"
#undef DECLFUNC
}

Unfortunately you can't do a #define inside a #define or this could be used to alias the names transparently with something like


#define DECLFUNC(ret,name,args) #define name (* STRING_JOIN(fp_,name) )
#include "allfuncs.inc"
#undef DECLFUNC

(ADDENDUM : or, you know, just use the /DELAYLOAD option)

07-15-12 - Internet Toggle

For the most part I'm not that prone to the time-wasting allure of the internet when I'm supposed to be working. But recently I've been writing the docs for Oodle and it's just excruciating boring work. Once in a while I have something important to say in the docs, but the vast majority is stuff like :

DOCGEN int Oodle_ComputeAPlusB(int A, int B);
/* Computes A + B

    $:A   the value of A in A+B
    $:B   the value of B in A+B
    $:return    the sum of and A and B

  Oh god please kill me now.

*/

Every time I wrote one of those docs I would involuntarily pop up my web browser and see if Kids on Crack had any updates. So I started turning off my internet access to break that habit.

I was using ipconfig /release and /renew to do it, but that has a few problems : 1. it's very slow (not a big deal since the toggle is rare), and 2. it also kills my VPN to work.

Jeff suggested a better way is to turn DNS off and on. I found netsh can do that pretty easily. I'd never used netsh before, it's pretty nice. The commands are :


dns_on.bat :

netsh interface ip set dns name="Wireless Network Connection" dhcp

dns_off.bat :

netsh interface ip set dns name="Wireless Network Connection" static 0.0.0.0
ipconfig /flushdns

(or, you know, slightly different as is appropriate for your net config).
(note the flushdns also, otherwise you'll have things like Google in cache).

For machines you want to still access in "internet off" mode, you can of course just use their explicit IP, or slightly nicer you can add them to your "hosts" file.

(aside : another cool feature of netsh is that you can save & restore your entire network config if you want to temporarily mess things up; just use "netsh dump" to save it and "netsh exec" to restore it).

07-15-12 - cbvector

I tried to use std::vector in one of the Oodle examples, and the freaking MSVC STL generates link dependencies by default. WTF. So I got annoyed and wrote my own single file standalone vector.

Caveats : this is ripped out of cblib (with dependencies removed), so it's rather ugly. This is not totally STL compatible. It works well enough for my purposes. Feel free to improve it.

Download : cbvector.h at cbloom.com

cbvector.h intentionally doesn't include anything. You may need to include stddef.h and/or new.h before cbvector.h.

WARNING : cbvector.h uses #defines to configure various aspects of the implementation. If you set those #defines to different things in different code files, then you must ensure that one of the following is true : 1. the CB_VECTOR class name is different in each case, or 2. the linkage of all funcs affected is "static", or 3. the entire cbvector is in an anonymous namespace (which is the default). If you're not careful about this you can get super bizarro bugs when the linker merges functions that aren't actually the same.

6/21/2012

06-21-12 - Two Alternative Oodles

I want to write about two interesting ideas for async IO systems that I am *not* doing in Oodle.

1. The graph-forwarding automated parallelism system.

The idea here is that you make all your async operations be like flow chart widgets, they have "in" and "out" channels, and then you can draw links and hook up ins to outs.

This creates dependencies automatically, each op depends on everything that feeds its input channels.

So for example you might do :


"c:\junk"  :-> OpenFile :-> fh

(that is, OpenFile takes some string as input and then puts out a file handle named "fh")

fh :->   ReadFile  :->   buf[0-32768]
buf :->
0-32768  :->

(that is, make a ReadFile op that takes the output of the Openfile, and outputs a valid buffer range)

buf[0-32768]  :->  SPU_LZCompress :-> lzSize , compbuf[0-lzSize]
compbuf

(do an LZ compress on the valid buf range and output to compressed buf)

lzSize , compbuf[0-lzSize] :-> WriteFile

etc..

This sets up a chain of operations with dependencies, you can fire it off and then wait on it all to complete. So, in a sense it's nice because you don't have to write code that waits on the completion of each op and fires the next op and so on.

There are various ways you could set up the async chains, obviously GUI tools where you drag lines around are popular for this sort of thing, but I think that's a terrible way to go. You could just have a text markup, or some creation functions that you call to build up the graph.

Another interesting option is to use the "just run the code" method. That is, you make proxy classes for all the variable types and do-nothing functions with the names of the ops (ReadFile, etc.); then you just run this fake imperative code, and all it does is record the calls and the arguments, and uses that to build the graph. That's easy enough to do for code without branches, but that's sort of a trivial case and I'm not sure how to make it work with branches. In fact in general this type of thing sucks bad for code with loops or branches.

Anyway, I think that this method is basically a terrible idea, except for one thing : creating the graph of the entire async operation before doing it can be a huge performance win. It allows you to "see the future" in terms of what the client wants to do, and thus make better scheduling decisions to maximimize utilization of your available computation resources and disk at all times.

In the simplest case, if the client calls a huge Read and the a huge LZcompress after that, that's a dumb non-parallel way to do things, but in the normal imperative Oodle I can't do anything about it, because at the time I get the Read I don't know what's coming after it. If you gave me the graph, I could go, oh hey during the Read I'm not using the CPU at all, and during the big LZCompress I'm not using the disk, so let me break those into smaller pieces and overlap them. Obviously you can schedule IO's around the timeline so that they try to be issued early enough so their results are back when needed. (though even with the full async graph you can't really schedule right unless you know how long the cpu operations are going to take).

There are even more subtle low level ways that not knowing the future gets me. In the whole worker thread system, there are crucial decisions like "should I wake up a new worker thread to take this work item, or wait for one of the existing worker threads to take it?" or "should I signal the worker threads as I make each item, or wait until I make them all?" ; or even just deciding which worker thread should take a task to maximize cache coherence. You cannot possibly get these decisions right without knowing the future.

Anyhoo, I don't think the advantage outweighs the horribleness of writing code this way, so on to the next one :

2. The coroutine auto-yielding system.

What if your file IO was always done from a coroutine, and instead of blocking when it didn't have the bytes needed, it just yielded the coroutine? That would give you fully async file IO (in the sense that you never block a thread just waiting on IO), and you could write code just like plain sync IO.

eg. you would just write :


// in a coroutine :

FILE * fp = fopen("c:\junk","rb");  // yields the coroutine for the async open

int c = fgetc(fp); // might yield for a buffer fill

etc..

This is sort of appealing, it certainly makes it easy to write async IO code. I personally don't really love the fact that the thread yielding is totally hidden, I like major functional operation to be clear in the imperative code.

The real practical problem is that you just can't make this nice in the hacky switch-based C coroutine method that I use. You really need a language that supports coroutines natively.

You can cook up a clunky way of doing this with the C coroutines, something like :


#define cofread(fp,buf,size) \
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(fp,buf,size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }


where COROUTINE_YIELD is my macro that does something like :
  self->state = 7;
  return;
  case 7:

So now you can call cofread() from an Oodle coroutine and it kind of does what we want.

But, because of the fact that we use switches and returns, we can't use any stack variables in the arguments to cofread ; eg :

{
    int size = 16384;

    cofread(fp,buf,size);
}
is no good, if you resume at the YIELD point inside cofread, "size" is gone. (you'd get a case statement skips variable initialization error or something like that).

Basically with the hacky C coroutine method you just can't do funny business like this where you hide the control flow; you have to make the yield points very explicit because they are points where you lose all your stack variables and must recreate them.

Perhaps a larger issue is that if you really were going to go with the full coroutine auto-yielding system, you'd want to be able to yield from inside function calls, not just at the root level of the coroutine. eg. you'd like to call functions that might do file IO or fire off worker tasks, and you want them to be able to yield too. That's not possible unless you have full stack-saving coroutines.

ADDENDUM for clarity :

It's totally trivial to fix the lack of stack saving in a limited way. All I have to do is reserve a few slots in the coroutine struct that cofread can use to store its variables. So cofread becomes :


#define cofread(fp,buf,size) \
    co->m_fp = fp; co->m_buf = buf; co->m_size = size;
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(co->m_fp,co->m_buf,co->m_size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }

and now you really can use cofread within a coroutine, and you can use local variables as arguments to it, and it yields if it can't complete immediately, and that's all nice.

But it's *not* what I really want for this proposal, which is a full transparent system that a client can build their IO on. The problem is that cofread can only be called at the "root" level of a coroutine. That is, because the "yield" is not a true language yield that preserves function call stack, it must be in the base coroutine function.

eg. you can do :


MyCoroutine1( coroutine * co )
{

COROUTINE_START()

   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);

COROUTINE_DONE()

}

That's easy. But you cannnot do :

void MyHelper()
{
   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);
}


MyCoroutine2( coroutine * co )
{

COROUTINE_START()

    MyHelper();

COROUTINE_DONE()

}

and that lack of composability makes it unusable as a general purpose way to do IO.

To be super clear and redundant again - Oodle of course does support and extensively uses coroutine IO, but it is for small tasks that I want to have maximum performance (like, eg. read and decompress a Package), where the limitation of having to write all your yielding code within one function is okay. The idea of proposal #2 is to make a system that is visible to the client and totally transparent, which they could use to write all their game IO.

(ASIDE : there is a way to do this in C++ in theory (but not in practice). What you do is do all your yielding at the coroutine root level still, either using the switch method or the lambda method (doesn't really matter). To do yields inside function calls, what you do is have your IO routines throw a DataNotReady exception. Then at the coroutine root level you catch that exception and yield. When you resume from the yield, you retry the function call and should make it further this time (but might throw again). To do this, all your functions must be fully rewindable, that is they should be exception safe, and should use classes that back out any uncommitted changes on destruction. I believe this makes the idea technically possible, but unusable in reality).

6/19/2012

06-19-12 - Two Learnings

Two things that I got wrong in the past and have only recently come around to what I believe is the right way.

1. Use pointer-sized ints for memory buffer sizes.

When I made the transition to building 32-64-bit cross platform I was really annoyed with the fact that I couldn't just use "int" everywhere any more. To make it easier for myself, I mostly just used int64 for memory buffers, and made a bunch of helpers for myself that returned int instead of size_t and intptr_t , eg. for things like vector.size() and strlen() and so on I used int-returning variants.

The nice thing about using int64 for memory buffer sizes is that your type doesn't change when you build different targets; that makes coding a lot simpler (and removes the possibility of bugs due to struct sizes changing and such).

Only very recently have I come around, and it's probably not for the reason you think. It's because using a separate type for "pointer sized thing" is a good way of documenting functions with types.

In the previous API post I talked about how I like the function arg types to be as self-documenting as possible. It's even better if you can make it a compile error or warning when you mis-use it. So I was looking at API's like :


Oodle_Read( OodleFile * f, void * buffer, S64 size, U64 filePos );

in particular at the call site where you have something like :

S64 i1,i2;

Oodle_Read(f,ptr,i1,i2);

you can't tell what the last two args are just by looking at the call site, and what's worse, you can switch them in the call order and get no warning. That sucks. It's much better to do :

Oodle_Read( OodleFile * f, void * buffer, SINTa size, U64 filePos );

(SINTa is the RAD pointer-sized signed int). It's now clearly documented in the variable types - this is the size of a memory buffer.

If you have an old code base, it's very annoying at first to do this, because you'll have a lot of conversions to do. It only becomes clean once you have changed your whole code base to follow the rules :


SINTA/UINTa - all memory buffer sizes and array sizes

S64/U64 - file positions and sizes

S32/U32 - really not used for much, just enums and modes and flags, not array sizes

Not only does this put more documentation on the variable, it also makes it more clear when you are doing dangerous cast-downs.

eg. when you try to read a whole file into a memory buffer. You get the file size as an S64 and then you need to pass something to malloc, which takes an SINTa. That makes a clear single point where you are doing a questionable cast. Once you've done that one cast, the rest of the code is cast-free which is nice.

Furthermore, I think it's augmented by having helpers for the cast-downs :


U32 oo64to32(U64 x);  // used super rarely, very weird thing to do
U32 ooAto32(UINTa x); // used super rarely, very weird thing to do

UINTa oo64toA(U64 x); // used for file sizes -> memory buffers, somewhat common

There are only three cast-downs needed, and really the first two should almost never be used; if you are using them a lot it's a sign of bad code. The last one is common, and it should do a check to ensure the cast is okay (eg. if using a > 2GB file size on a 32 bit system).

2. Avoid recursive mutexes.

Like many people, I read the wisdom of the ancients (old school threading programmers) who said that "recursive mutexes are of questionable value and lead to dangerous code; prefer non-recursive mutexes" ; I read that and I went "psshaw" and thought they were crotchety dinosaurs. Well, I've come around.

The thing about recursive mutexes is that like much code which is attractive to the novice, they make the trivial case simpler and look cleaner, but they make the hard case much worse, and the hard case if what actually matters.

The trivial case is : object only locks itself, never locks other objects, never can be freed while locked, etc.

But inevitably in real world threaded code you have to deal with the harder case for mutexes, which is : object might have to lock other objects (and they might have to lock it); lock order may be hard to establish; object may want to free itself while locked, etc.

The way that the novice normally makes a thread-safe object with a recursive mutex is to put a lock scoper in every function on that object :


Object::Func(...)
{
  m_mutex.Lock();

  do stuff;

  m_mutex.Unlock();
}

or really :

Object::Func(...)
{
  MUTEX_SCOPER(m_mutex);

  do stuff;
}

and crucially, when you call other member functions on yourself, that takes the mutex again recursively. (we're going to ignore the efficiency cost of taking the mutex many times). At first that way seems nice, it's easy, but then you encounter one of the real world problems and it falls apart.

What's better is to separate the mutex-taking from the actions, so instead you do :


Object::Func_Locked(...)
{
  do stuff;
}

Object::Func_Unlocked(...)
{
  MUTEX_SCOPER(m_mutex);

  Func_Locked();
}

Then all the functionality is in the _Locked functions and they can call each other.

So now you need to do something like call A->Func1 and B->Func2 , and it has to be done "atomically" , eg. with both locked. Then you run into the issue of mutex order of acquisition and possible deadlocks. If you have used the first style where objects lock themselves, then it is impossible for objects to call each other safely. That is, object A can never call object B, because object B might call object A and there's a deadlock. But with the _Locked functions, any _Locked function can call to another _Locked function, and they don't have to worry about deadlock; instead the lock taking is all pushed up and can be done in a standardized order (or even atomically if you have multi-mutex lock acquisiton).

The other thing that non-recursive locks cleans up, is that you know whenever you see a call to Unlock(), that's the *last* call to unlock and it definitely makes the object publicly visible. That is, there is a crucial transition point from "I own this object" to "others can have it", and with the recursive lock method, that transition point is muddied.

For example, consider this simple and common case : something in the object's internal state tells you it should be deleted. You must do that atomically, because if you unlock then delete, the internal state might change, eg. :


Object::Func()
{
  m_mutex.lock();

  if ( m_x > m_y )
     deleteMe = true;

  m_mutex.unlock();

  // *!*

  if ( deleteMe )
    delete this;
}

That code is no good, at the ! point, some other thread might touch object and make the check m_x > m_y no longer be true, and then we would be deleting this object incorrectly, when the invariant is not set. In order to make this work you need a combined "unlock_and_delete" operation. But to do that you need to know that your unlock is the *last* unlock - you can't do that in the recursive mutex style.

Basically the recursive mutex is sort of like optimizing the code that's already fast; it makes the simple case a bit simpler, but it makes the real world hard cases much worse. Better to just start from the beginning with the more robust long term solution.

Once you realize that the advantage of a recursive mutex is an illusion, then the advantages of the non- recursive mutex become appealing. You can implement a non-recursive mutex far more efficiently. It can take only 1 bit of memory. Every object can easily have its own non-recursive mutex and they don't even need to be allocated or initialized at all.

6/18/2012

06-18-12 - Run Logged

A handy batch file :

runlogged.bat :

set args=%*
if "%args%"=="" end.bat
set Now=%TIME%
REM ECHO It's %Now% now
REM remove decimal part of time :
FOR /F "tokens=1 delims=." %%A IN ("%Now%") DO SET Now=%%A
REM ECHO It's %Now% now
REM alternate way :
REM FOR /F %%A IN ('TIME/T') DO SET Now=%%A
REM ECHO It's %Now% now
set str=%DATE%_%Now%_%args%
call make_clean_str %str%
set str=%str:.=_%
if not exist r:\runlogs call md r:\runlogs
set logname=r:\runlogs\%str%.log
call %args% > %logname% 2>&1
call d %logname%
type %logname%

make_clean_str.bat :

set str=%*
set str=%str: =_%
set str=%str::=_%
set str=%str:\=_%
set str=%str:/=_%
set str=%str:;=_%

make_clean_str is handy any time you want to make a string into a file name, it gets rid of illegal characters.

runlogged runs a command line and saves an archive of all runs, with the arg list and the date/time in the file name so they never collide.

Thanks to these two :

vanderwoude Batch files
ss64 Windows CMD Command Syntax

If the internet wasn't such a useless pile of turd, I would completely download both of their sites for my reference library.

I've written before, but repeating : what I really want is a fully journalling OS; every thing I ever run should be fully tracked and completely reproducable; all the input & output files should be saved. Oh well, this is better than nothing.

old rants