3/16/2013

03-16-13 - Writing Portable Code Rambles

Some thoughts after spending some time on this (still a newbie). How I would do it differently if I started from scratch.

1. Obviously you all know the best practice of using your own data types (S32 or whatever) and making macros for any kind of common operation that the standards don't handle well (like use a SEL macro instead of ?: , make a macro for ROT, etc). Never use bit-fields, make your own macros for manipulating bits within words. You also have to make your own whole macro meta-language for things not quite in the language, like data alignment, restrict/alias, etc. etc. (god damn C standard people, spend some time on the actual problems that real coders face every day. Thanks mkay). That's background and it's the way to go.

Make your own defines for SIZEOF_POINTER since stupid C doesn't give you any way to check sizeof() in a macro. You probably also want SIZEOF_REGISTER. You need your own equivalent of ptrdiff_t and intptr_t. Best practice is to use pointer-sized ints for all indexing of arrays and buffer sizes.

(one annoying complication is that there are platforms with 64 bit pointers on which 64-bit int math is very slow; for example they might not have a 64-bit multiply at all and have to emulate it. In that case you will want to use 32-bit ints for array access when possible; bleh)

Avoid using "wchar_t" because it is not always the same size. Try to explicitly use UTF16 or UTF32 in your code. You could make your own SIZEOF_WCHAR and select one or the other on the appropriate platform. (really try to avoid using wchar at all; just use U16 or U32 and do your own UTF encoding).

One thing I would add to the macro meta-language next time is to wrap every single function (and class) in my code. That is, instead of :


int myfunc( int args );

do

FUNC1 int FUNC2 myfunc(int args );

or even better :

FUNC( int , myfunc , (int args) );

this gives you lots of power to add attributes and other munging as may be needed later on some platforms. If I was doing this again I would use the last style, and I would have two of them, a FUNC_PUBLIC and FUNC_PRIVATE to control linkage. Probably should have separate wrapper macros for the proto and the body.

While you're at it you may as well have a preamble in every func too :


FUNC_PUBLIC_BODY( int , myfunc , (int args) )
{
    FUNC_PUBLIC_PRE

    ...
}

which lets you add automatic func tracing, profiling, logging, and so on.

I wish I had made several different layers of platform Id #defines. The first one you want is the lowest level, which explicitly Id's the current platform. These should be exclusive (no overlaps), something like OODLE_PLATFORM_X86X64_WIN32 or OODLE_PLATFORM_PS3_PPU.

Then I'd like another layer that's platform *groups*. For me the groups would probably be OODLE_PLATFORM_GROUP_PC , GROUP_CONSOLE, and GROUP_EMBEDDED. Those let you make gross characterizations like on "GROUP_PC" you use more memory and have more debug systems and such. With these mutually exclusive platform checks, you should never use an #else. That is, don't do :

#if OODLE_PLATFORM_X86X64_WIN32
.. some code ..
#else
.. fallback ..
#endif
it's much better to explicitly enumerate which platforms you want to go to which code block, and then have an
#else
#error new platform
#endif
at the end of every check. That way when you try building on new platforms that you haven't thought carefully about yet, you get nice compiler notification about all the places where you need to think "should it use this code path or should I write a new one". Fallbacks are evil! I hate fallbacks, give me errors.

Aside from the explicit platforms and groups I would have platform flags or caps which are non-mutually exclusive. Things like PLATFORM_FLAG_STDIN_CONSOLE.

While you want the raw platform checks, in end code I wish I had avoided using them explicitly, and instead converted them into logical queries about the platform. What I mean is, when you just have an "#if some platform" in the code, it doesn't make it clear why you care that's the platform, and it doesn't make it reusable. For example I have things like :

#if PLATFORM_X86X64
// .. do string matching by U64 and xor/cntlz
#else
// unaligned U64 read may be slow
// do string match byte by byte
#endif
what I should have done is to introduce an abstraction layer in the #if that makes it clear what I am checking for, like :

#if PLATFORM_X86X64
#define PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS 1
#elif PLATFORM_PS3
#define PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS 0
#else
#error classify me
#endif

#if PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS
// .. do string matching by U64 and xor/cntlz
#else
// unaligned U64 read may be slow
// do string match byte by byte
#endif

then it's really clear what you want to know and how to classify new platforms. It also lets you reuse that toggle in lots of places without code duping the fiddly bit, which is the platform classification.

Note that when doing this, it's best to make high level usage-specific switches. You might be tempted to try to use platform attributes there. Like instead of "PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS" you might want to use "PLATFORM_SWITCH_UNALIGNED_READ_PENALTY" . But that's not actually what you want to know, you want to know if on my particular application (LZ string match) it's better to use big words or not, and that might not match the low level attribute of the CPU.

It's really tempting to skip all this and abuse the switches you can see (lord knows I do it); I see (and write) lots of code that does evil things like using "#ifdef _MSC_VER" to mean something totally different like "is this x86 or x64" ? Of course that screws you when you move to another x86 platform and you aren't detecting it correctly (or when you use MSVC to make PPC or ARM compiles).

Okay, that's all pretty standard, now for the new bit :

2. I would opaque out the system APIs in two levels. I haven't actually ever done this, so grains of salt, but I'm pretty convinced it's the right way to go after working with a more standard system.

(for the record : the standard way is to make a set of wrappers that tries to behave the same on all systems, eg. that tries to hide what system you are on as much as possible. Then if you need to do platform-specific stuff you would just include the platform system headers and talk to them directly. That's what I'm saying is not good.)

In the proposed alternative, the first level would just be a wrapper on the system APIs with minimal or no behavior change. That is, it's just passing them through and standardizing naming and behavior.

At this level you are doing a few things :

2.A. Hiding the system includes from the rest of your app. System includes are often in different places, and often turn on compiler flags in nasty ways. You want to remove that variation from the rest of your code so that your main codebase only sees your own wrapper header.

2.B. Standardizing naming. For example the MSVC POSIX funcs are all named wrong; at this level you can patch that all up.

2.C. Fixing things that are slightly different or don't work on various platforms where they really should be the same. For example things like pthreads are not actually all the same on all the pthreads platforms, and that can catch you out in nasty ways. (eg. things like sem_init always failing on Mac).

Note this is *not* trying to make non-POSIX platforms look like POSIX. It's not hiding the system you're on, just wrapping it in a standard way.

2.D. I would also go ahead and add my own asserts for args and returns in this layer, because I hate functions that just return error codes when there's a catastrophic failure like a null arg or an EHEAPCORRUPT or whatever.

So once you have this wrapper you no longer call any system funcs directly from your main codebase, but you still would be doing things like :


#if PLATFORM_WIN32

    HANDLE h = platform_CreateFile( ... )

#elif PLATFORM_POSIX

    int fd = platform_open( name , flags )

#else
    #error unknown platform
#endif

that is, you're not hiding what platform you're on, you're still letting the larger codebase get to the low level calls, it's just the mess of how fucked they are that's hidden a bit.

3. You then have a second level of wrapping which tries to make same-action interfaces that dont require you to know what platform you're on. Second level is written on the first level.

The second level wrappers should be as high level as necessary to opaque out the operation. For example rather than having "make temp file name" and "open file" you might have "open file with temp name", because on some platforms that can be more efficient when you know it is a high-level combined op. You don't just have "GetTime" you have "GetTimeMonotonic" , because on some platforms they have an efficient monotonic clock for you, and on other platforms/hardwares you may have to do a lot of complicated work to ensure a reliable clock (that you don't want to do in the low level timer).

When a platform can't provide a high-level function efficiently, rather than emulate it in a complex way I'd rather just not have it - not a stub that fails, but no definition at all. That way I get a compile error and in those spots I can do something different, using the level 1 APIs.

The first level wrappers are very independent of the large code base's usage, but the second level wrappers are very much specifically designed for their usage.

To be clear about the problem of making platform-hiding second layer wrappers, consider something like OpenFile(). What are the args to that? What can it do? It's hopeless to make something that works on all platforms without greatly reducing the capabilities of some platforms. And the meaning of various options (like async, temporary, buffered, etc.) all changes with platform.

If you wanted to really make a general purpose multi-platform OpenFile you would have to use some kind of "caps" query system, where you first do something like OpenFile_QueryCaps( OF_DOES_UNBUFFERED_MEAN_ALIGNMENT_IS_REQUIRED ) and it would be an ugly disaster. (and it's obviously wrong on the face of it, because really what you're doing there is saying "is this win32" ?). The alternative to the crazy caps system is to just make the high level wrappers very limited and specific to your usage. So you could make a platform-agnostic wrapper like OpenFile_ForReadShared_StandardFlagsAndPermissions(). Then the platforms can all do slightly different things and satisfy the high level goal of the imperative in the best way for that platform.

A good second level has as few functions as possible, and they are as high level as possible. Making them very high level allows you to do different compound ops on the platform in a way that's hidden from the larger codebase.

3/10/2013

03-10-13 - Two LZ Notes

Note 1 : on rep matches.

"Rep matches" are a little weird. They help a lot, but the reason why they help depends on the file you are compressing. (rep match = repeat match, gap match, aka "last offset")

On text files, they work as interrupted matches, or "gap matches". They let you generate something like :


stand on the floor
stand in the door

stand in the door
[stand ][i][n the ][d][oor]

[off 19, len 6][1 lit][rep len 6][1 lit][off 18, len 3]

that is, you have a long match of [stand on the ] but with a gap at the 'o'.

Now, something I observed was that more than one last offset continues to help. On text the main benefit from having two last offsets is that it lets you use a match for the gap. When the gap is not just one character but a word, you might want to use a match to put that word in, in which case the continuation after the gap is no longer the first last-offset, it's the second one. eg.


cope
how to work with animals
how to cope with animals

[how to ][cope][ with animals]
[off 25 ][off 32][off 25 (rep2)]

You could imagine alternative coding structures that don't require keeping some number of "last offsets". (oddly, the last offset maintenance can be a large part of decode time, because maintaining an MTF list is something that CPUs do incredibly poorly). For example you could code with a scheme where you just send the entire long match, and then any time you send a long match you send a flag for "are there any gaps", and if so you then code some gaps inside the match.

The funny thing is, on binary files "last offsets" do something else which can be more important. They become the most common offsets. In particular, on highly structured binary data, they will generally be some factor of the structure size. eg. on a file that has a struct size of 36, and that struct has dwords and such in it, the last offsets will generally be things like 4,8,16,36, or 72. They provide a sort of dictionary of the most common offsets so that you can code those smaller. You are still getting the gap-match effect, but the common-offset benefit is much bigger on these files.

(aside : word-replacing transform on text really helps LZ (and everything) by removing the length variance of tokens. In particular for LZ77, word length variance breaks rep matches. There are lots of common occurances of a single replaced word in a phrase, like : "I want some stuff" -> "I want the stuff". You can't get a rep match here of [ stuff] because the offset changed because the substituted word was different length. If you do WRT first, then gap matches get these.)

Note 2 : on offset structure.

I've had it in the back of my head for quite some time now to do an LZ compressor specifically designed for structured data.

One idea I had was to use "2d" match offsets. That is, send a {dx,dy} where dx is within the record and dy is different records. Like imagine the data is in a table, dy is going back rows, dx is an offset on the row. You probably want to mod dx around the row so its range is always the same, and special case dy=0 (matches within your own record).

It occurred to me that the standard way of sending LZ offsets these days actually already does this. The normal way that good LZ's send offsets these days is to break it into low and high parts :

low = offset & 7F;
high = offset >> 7;
or similar, then you send "high" using some kind of "NoSB" scheme (Number of Significant Bits is entropy coded, and the bits themselves are sent raw), and you send "low" with an order-0 entropy coder.

But this is just a 2d structured record offset for a particular power-of-2 record size. It's why when I've experimented with 2d offsets I haven't seen huge wins - because I'm already doing it.

There is some win to be had from custom 2d-offsets (vs. the standard low/high bits scheme) when the record size is not a power of two.

3/06/2013

03-06-13 - Sympathy for the Library Writer

Over the years of being a coder who was a library-consumer and not a library-writer, I've done my share of griping about annoying API's or what I saw as pointless complication or ineffiency. Man, I've been humbled by my own experience trying to write a public library. It is *hard*.

The big problem with libraries is that you don't control how they're used. This is in contrast to game engines. Game engines are not libraries. I've worked on many game engines over the years, including ones that went out to large free user bases (Genesis 3d and Wild Tangent), and they are much much easier than libraries.

The difference is that game engines generally impose an architecture on the user. They force you to use it in a certain way. (this is of course why more advanced developers despise them so much; it sucks to have some 3rd party telling you your code architecture). It's totally acceptable if a game engine only works well when you use it in the approved way, and is really slow if you abuse it, or it could even crash if you use it oddly.

A library has to be flexible about how it's used; it can't impose a system on the user, like a certain threading model, or a certain memory management model, or even an error-handling style.

Personally when I do IO for games, I make a "tool path" that just uses stdio and is very simple and flexible, does streaming IO and text parsing and so on, but isn't shipped with the game, and I make a "game path" that only does large-block async IO that's pre-baked so you can just point at it. I find that system is powerful enough for my use, it's easy to write and use. It means that the "tool path" doesn't have to be particularly fast, and the fast game path doesn't need to support buffered character IO or anything other than big block reads.

But I can't force that model on clients, so I have to support all the permutations and I have to make them all decently fast.

A lot of times in the past I've complained about over-complicated APIs that have tons of crazy options that nobody ever needs (look at the IJG jpeg code for example). Well, now I see that often those complicated APIs were made because somebody (probably somebody important) needed those options. Of course as the library provider you can offer the complex interface and also simpler alternatives, but that has its own pitfalls of making the API bigger and more redundant (like if you offer OpenFileSimple and OpenFileComplex); in some ways it's better to only offer the complex API and make the user wrap it and reduce the parameter set to what they actually use.

There's also a sort of "liability" issue in libraries. Not exactly legal liability, but program bad behavior liability. Lots of things that would make the library easier to use and faster are naughty to do automatically. For example Oodle under Vista+ can run faster with elevated priviledge, to get access to some of the unsecure file APIs (like extending without zeroing), but it would be naughty for me to do that automatically, so instead I have to add an extra step to make the client specifically ask for that.

Optimization for me has really become a nightmare. At first I was trying to make every function fast, but it's impossible, there are just too many entry points and too many usage patterns. Now my philosophy is to make certain core functions fast, and then address problems in the bigger high level API as customers see issues. I remember as a game developer always being so pissed that all the GL drivers were specially optimized for Id. I would want to use the API in a slightly different style, and my way would be super slow, not for any good reason but just because it hadn't gotten the optimization loving of the important customer's use case.

I used to also rail about the "unnecessary" argument checking that all the 3d APIs do. It massively slows them down, and I would complain that I had ensured the arguments were good so just fucking pass them through, stop slowing me down with all your validation! But now I see that if you really do that, you will just constantly be crashing people as they pass in broken args. In fact arg validation is often the way that people figure out the API, either because they don't read the docs or because the docs are no good.

(this is not even getting into the issue of API design which is another area where I have been suitably humbled)

ADDENDUM : I guess I should mention the really obvious points that I didn't make.

1. One of the things that makes a public library so hard after release is that you can't refactor. The normal way I make APIs for myself (and for internal teams) is to sort of make an effort at a good API the first time, but it usually sucks, and you rip it out and go through big scourges of find-rep. That only works when you control all the code, the library and the consumer. It's only after several iterations that the API becomes really nice (and even then it's only nice for that specific use case, it might still suck in the wild).

2. APIs without users almost always suck. When someone goes away in a cave and works on a big new fancy library and then shows it to the world, it's probably terrible. This a problem that I think everyone at RAD faces. The code of mine that I really like is stuff that I use over and over, so that I see the flaws and when I want it to be easier to use I just go fix it.

3. There are two separate issues about what makes an API "good". One is "is it good for the user?" and one is "is it good for the library maintainer?". Often they are the same but not always.

Anyway, the main point of this post is supposed to be : the next time you complain about a bad library design, there may well be valid reasons why it is the way it is; they have to balance a lot of competing goals. And even if they got it wrong, hey it's hard.

3/01/2013

03-01-13 - Zopfli

zopfli seems to make small zips. There's no description of the algorithm so I can't comment on it. But hey, if you want small zips it seems to be the current leader.

(update : I've had a little look, and it seems to be pretty straightforward, it's an optimal parser + huff reset searcher. There are various other prior tools to do this (kzip,deflopt,defluff,etc). It's missing some of the things that I've written about before here, such as methods of dealing with the huff-parse feedback; the code looks pretty clean, so if you want a good zip-encoder code it looks like a good place to start.)

I've written these things before, but I will summarize here how to make small zips :

1. Use an exact (windowed) string matcher.

cbloom rants 09-24-12 - LZ String Matcher Decision Tree

2. Optimal parser. Optimal parsing zip is super easy because it has no "repeat match", so you can use plain old backwards scan. You do have the huffman code costs, so you have to consider at least one match candidate for each codeword length.

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 09-04-12 - LZ4 Optimal Parse

3. Deep huffman reset search. You can do this pretty easily by using some heuristics to set candidate points and then building a bottom-up tree. Zopfli seems to use a top-down greedy search. More huffman resets makes decode slower, so a good encoder should expose some kind space-speed tradeoff parameter (and/or a maximum number of resets).

cbloom rants 06-15-10 - Top down - Bottom up
cbloom rants 10-02-12 - Small note on Adaptive vs Static Modeling

4. Multi-parse. The optimal parser needs to be seeded in some way, with either initial code costs or some kind of heuristic parse. There may be multiple local minima, so the right way to do it is to run 4 seeds (or so) simultaneously with different strategies.

cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies

5. The only unsolved bit : huffman - parse feedback. The only solution I know to this is iteration. You should use some tricks like smoothing and better handling of the zero-frequency symbols, but it's just heuristics and iteration.


One cool thing to have would be a cheap way to compute incremental huffman cost.

That is, say you have some array of symbols. The symbols have a corresponding histogram and huffman code. The full huffman cost is :

fullcost(symbol set) = cost{ transmit code lengths } + sum[n] { codelen[n] * count[n] }
that is, the cost to send the code lengths + the cost of sending all the symbols with those code lengths.

You'd like to be able to do an incremental update of fullcost. That is, if you add one more symbol to the set, what is the delta of fullcost ?

*if* the huffman code lengths don't change, then the delta is just +codelen[symbol].

But, the addition of the symbol might change the code lengths, which causes fullcost to change in several ways.

I'm not sure if there's some clever fast way to do incremental updates; like when adding the symbol pushes you over the threshold to change the huffman tree, it often only changes some small local part of the tree, so you don't have to re-sum your whole histogram, just the changed part. Then you could slide your partition point across an array and find the optimal point quite quickly.


Now some ranting.

How sad is it that we're still using zip?

I've been thinking about writing my own super-zipper for many years, but I always stop myself because WTF is the point? I don't mean for the world, I guess I see that it is useful for some people, but it does nothing for *me*. Hey I could write some thing and probably no one would use it and I wouldn't get any reward from it and it would just be another depressing waste of some great effort like so many other things in my life.

It's weird to me that the best code in the world tends to be the type of code that's given away for free. The little nuggets of pure genius, the code that really has something special in it - that tends to be the free code. I'm thinking of compressors, hashers, data structures, the real gems. Now, I'm all for free code and sharing knowledge and so on, but it's not equal. We (the producers of those gems) are getting fucked on the deal. Apple and the financial service industry are gouging me in every possible immoral way, and I'm giving away the best work of my life for nothing. It's a sucker move, but it's too late. The only sensible play in a realpolitik sense of your own life optimization is to not work in algorithms.

Obviously anyone who claims that patents provide money to inventors is either a liar (Myhrvold etc.) or just has no familiarity with actual engineering. I often think about LZ77 as a case in point. The people who made money off LZ77 patents were PK and Stac, both of whom contributed *absolutely nothing*. Their variants were completely trivial obvious extensions of the original idea. Of course the real inventors (L&Z, and the modern variant is really due to S&S) didn't patent and got nothing. Same thing with GIF and LZW, etc. etc. perhaps v42 goes in there somewhere too; not a single one of the compression-patent money makers was an innovator. (and this is even igoring the larger anti-patent argument, which is that things like LZ77 would have been obvious to any number of researchers in the field at the time; it's almost always impossible to attribute scientific invention/discovery to an individual)

2/23/2013

02-23-13 - Threading Patterns - Wake Polling

Something I've written about a lot but never given a solid name to.

When a thread is waiting on some condition, your goal should be to only wake it up if that condition is actually true - that is, the thread really gets to run. In reverse order of badness :

1. Wakeup condition polling. This is the worst and is very common. You're essentially just using the thread wakeup to say "hey your condition *might* be set, check it yourself". The suspect code looks something like :


while( ! condition )
{
    Wait(event);
}

these threads can waste a ton of cycles just waking up, checking their condition, then going back to sleep.

One of the common ways to get nasty wake-polling is when you are trying to just wake one thread, but you have to do a broadcast due to the possibility of a missed wakeup (as in the naive semaphore from waitset ).

Of course any usage of cond_var is a wake-poll loop. I really don't like cond_var as an API or a programming pattern. It encourages you to write wakee-side condition checks. Whenever possible, waker-side condition checks are better. (See previous notes on cond vars such as : In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition").

(ADDENDUM : in fact I dislike cond_var so much I wrote a proposal on an alternative cond_var api ).

Now it's worth breaking this into two sub-categories :

1.A. Wake-polling when it is extremely likely that you get to run immediately.

This is super standard and is not that bad. At root, what's happening here is that under normal conditions, the wakeup means the condition is true and you get to run. The loop is only needed to catch the race where someone stole your wakeup.

For example, the way Linux implements semaphore on futex is a classic wake-poll. The core loop is :


    for(;;)
    {
        if ( try_wait() ) break;

        futex_wait( & sem->value, 0 ); // wait if sem value == 0
    }

If there's no contention, you wake from the wait and get to try_wait (dec the count) and proceed. The only time you have to loop is if someone else raced in and dec'ed the count before you. (see also in that same post a discussion of why you actually *want* that race to happen, for performance reasons).

The reason this is okay is because the futex semaphore only has to do a wake 1 when it signals. If it had to do a broadcast, this would be a bad loop. (and note that the reason it can do a broadcast is due to the special nature of the futex wait, which ensures that the single thread signal actually goes to someone who needs it!) (see : cbloom rants 08-01-11 - Double checked wait ).

1.B. Wake-polling when it is unlikely that you get to run.

This is the really bad one.

As I've noted previously ( cbloom rants 07-26-11 - Implementing Event WFMO ) this is a common way for people to implement WFMO. The crap implementation basically looks like this :


while ( any events in array[] not set )
{
    wait on an unset event in array[]
}

What this does is any time one of the events in the set triggers, it wakes up all the waiters that are waiting on it in an array, checks the array, and they go back to sleep.

Obviously this is terrible, it causes bad "thread thrashing" - tons of wakeups and immediate sleeps just to get one thread to eventually run.

2. "Direct Handoff" - minimal wakes. This is the ideal; you only wake a thread when you absolutely know it gets to run.

When only a single thread is waiting on the condition, this is pretty easy, because there's no issue of "stolen wakeups". With multiple threads waiting, this can be tricky.

The only way to really robustly ensure that you have direct handoff is by making the wakeup ensure the condition.

At the low level, you want threading primitives that don't give you unnecessary wakeups. eg. we don't like the pthreads cond_var that has you call :

    condvar.wait();
    mutex.lock();
as two separate calls, which means you can wake from the condvar and immediately fail to get the mutex and go back to sleep. Prefer a single call :
    condvar.wait_then_lock(mutex);
which only wakes you when you get a cv signal *and* can acquire the mutex.

At the high level, the main thing you should be doing is *waker* side checks.

eg. to do a good WFMO you should be checking for all-events-set on the *waker* side. To do this you must create a proxy event for the set when you enter the wait, register all the events on the proxy, and then you only signal the proxy when they are all set. When one of them is set, it does the checking. That is, the checking is moved to the signaller. The advantage is that the signalling thread is already running.

02-23-13 - Threading APIs that would be ideal

If you were writing an OS from scratch right now, what low level threading primitives should you provide? I contend they are rather different than the norm.

1. A low-level keyed event with double-checked wait.

Futex and NT's keyed event are both pretty great, but the ideal low level wait should be double-checked. I believe it should be something like :


HANDLE Waitset;

Waitset CreateWaitset();
DestroyWaitset(Waitset ws);

HANDLE wait_handle = Waitset_PrepareWait( Waitset ws , U64 key );

Waitset_CancelWait( Waitset ws , wait_handle h );
Waitset_Wait( Waitset ws , wait_handle h );

Waitset_Signal( Waitset ws, U64 key );

**Now, key of course could be a pointer, but there's no reason for it to be particularly. This is easily a superset of futex; if you want you could just have one global Waitset object, and key could be an int pointer, and you could check *ptr in between PrepareWait and Wait, that would give you futex. But you can do much more with this.

I prefer having a "waitset" object to put the waits on (like KeyedEvent), not just making it global/static (like futex). The advantage is 1. efficiency and 2. multiple meanings for a single "key". It's more efficient because you can have different waitsets for different uses, which makes each one cover fewer waits, which makes all the lookups faster. (that is, rather than 100 global waits pending, maybe you have 10 on 10 different waitsets). The other advantage is that you can reuse the same value for key without it confusing the system. You could have one Waitset where key is a pointer, and another where key is an internal handle number, etc.

2. A proper cond_var with waker-side condition checking.

First of all, a decent cond_var API combines a lot of the disjoint junk in the posix API. It should include the mutex, because that allows for vastly more efficient implementation :


    class condition_var
    {
    public:
        void lock();
        void unlock();
    
        // the below are always called with lock held :

        void unlock_wait_lock();
        
        void signal_unlock();
        void broadcast_unlock();

    private:
        ...
    };

The basic usage of this cv is like :

    cv.lock();

    while( ! condition )
    {
        cv.unlock_wait_lock();
    }

    .. do stuff with condition true ..

    cv.unlock();

A good implementation should do the compound ops (signal_unlock, etc) atomically. But I wouldn't require that because it's too hard.

But that's just background. What you really want is to put the condition check in the API. It should be :


        void wait_lock( [] { wake condition } );

The spec of the API is that "wake condition" is some code that will be run with the mutex locked, and when the function exits you will own the mutex and the condition is true. Then client usage is like :

    cv.wait_lock( condition );

    .. do stuff with condition true ..

    cv.unlock();

which allows for much more efficient implementation. The wake condition of the waiter list can be evaluated easily inside signal_unlock(), because that's always called with the mutex held.

02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

cbloom rants 12-21-12 - Coroutine-centric Architecture is a proposed architecture.

Why do I think it should be that way? Let's revisit some points.

1. Main thread should be a worker and all workers should be symmetric. That is, there's only one type of thread - worker threads, and all functions are work items. There are no special-purpose threads.

The purpose of this is to minimize thread switches, and to make waits be immediate runs whenever possible.

Consider the alternative. Say you have a classic "main" thread and a worker thread. Your main thread is running along and then decides it has to Wait() on a work item. It has to sleep the thread pending a notification from the worker thread. The OS has to switch to the worker, run the job, notify, then switch you back.

With fully symmetric threads, there is no actual thread wait there. If the work item is not started, or is in a yield point of a coroutine, you simply pop it and run it immediately. (of course your main() also has to be a coroutine, so that it can be yielded out at that spot to run the work item). Symmetric threads = less thread switching.

There are other advantages. One is that you're less affected by the OS starving one of your threads. When your threads are not symmetric, if one is starved (and is the bottleneck) it can ruin your throughput; one crucial job or IO can't run and then all the other threads back up. With symmetric threads, someone else grabs that job and off you go.

Symmetric threads are self-balancing. Any time you decide "we have 2 threads for graphics and 1 for compute" you are assuming you know your load exactly, and you can only be wrong. Symmetric threads maximize the utilization of the cpu. (Note that for cache coherence you might want to have a system that *prefers* to keep the same time of job on the same thread, but that's only a soft preference and it will run other jobs if there are none of the right type).

Symmetric threads scale cleanly down to 1. This is a big one that I think is important. Even just for debugging purposes, you want to be able to run your system non-threaded. With asymmetric threads you have to have a custom "non-threaded" pathway, which leads to bugs and means you aren't testing the same threaded pathway. The symmetric thread system scales down to 1 thread using the same code as always - when you wait on a job, if it hasn't been started it's just run immediately.

It's also much easier to have deadlocks in asymmetric thread systems. If an IO job waits on a graphics job, and a graphics job waits on an IO job, you're in a tricky situation; of course you shouldn't deadlock as long as there are no circular dependencies, but if one of those threads is processing in FIFO order you can get in trouble. It's just better to have a system where that issue doesn't even arise.

2. Deep yield.

Obviously if you want to write real software, you can't be returning out to the root level of the coroutine every time you want to yield.

In the full coroutine-centric architecture, all the OS waits (mutex locks, etc) should be coroutine yields. The only way to do that is if they can call yield() internally and it's a full stack-staving deep yield.

Of course you should be able to spawn more coroutines from inside your coroutine, and wait on them (with that wait being a yield). That is, aside from the outer branch-merge, each internal operation should be able to do its own branch-merge, and yield its thread to its sub-items.

3. Everything GC.

This is just the only reasonable way to write code in this system. It gives you a race-free way to ensure that object lifetimes exceed their usage.

The last post I did about the simple string crash is just so easy to do. The problem is that without GC you inevitably try to be "clever" and "efficient" (really "dumb" and "pointless") about your lifetime management. That is, you'll write things like :


void func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

Wait(h);
}

which is okay, because it waits on the async op inside the lifetime of "name". But of course a week later you change this function to :

Handle func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

return h;
}

with the wait done externally, and now it's a crash. Manual lifetime management in heavily-threaded code is just not reasonable.

The other compelling reason is that you want to be able to have "dangling" coroutines, that is you don't want to have to wait on them and clean them up on the outside, just fire them off and the clean themselves when they finish. This requires some kind of ref-counted or GC'ed ownership of all objects.

4. A thread per core.

With all your "threading" as coroutines and all your waits as "yields", you no longer need threads to share the cpu time, so you just make one thread per core and leave it there.

I wanted to note an exception to this - OS signals that cannot be converted to yields, such as IO. In this case you still need to do a true OS Wait that would block a thread. This would stop your entire worker from running, so that's not nice.

The solution is to have a separate pool of threads for running the small set of OS functions that do internal thread waits. That is, you convert :


ReadFile( args )

->

yield RunOnThreadPool( ReadFile, args );

this separate pool of threads is in addition to the one per core (or it could just be all one pool, and you make new threads as needed to ensure that #cores of them are running).

2/18/2013

02-18-13 - The Myth of Future Value

I believe that most people (aka me) grossly overvalue future rewards when weighing the merits of various options.

I've been thinking about this a lot over the last fews days, and have come to it simultaneously from several different angles.

For the past month or so I've been going over my finances, reviewing my spending, because I'm not happy with the amount I'm saving, and I'm trying to figure out where the money is leaking. Obviously there are big expenses like cars and vacations, but those I've budgeted for, they're not the leak (*) (**), but there's still a large general money leakage that I want to track down. It turns out a lot of it is buying stuff for the house or for productivity, stuff that on its own I can justify, but overall adds up to a big waste.

A lot of that waste are things that I tell myself will "pay off someday". Like I need some rope for around the house; hey look it's a much better deal if I buy it in a 500 foot spool. I'll use it eventually so that's the better buy. Or, I need to set a bolt in concrete; sure a hammer drill is expensive, but I'll use it the rest of my life, so it will be a good value some day (better than renting one for this one job). etc. Lots of stuff where the idea is that in the long term it will be a good value.

Now I certainly haven't hit the "long term" yet, but I can already see the flaw in that logic. There are lots of reasons why that imagined long term value might never come. I might never wind up using the stuff. It might get damaged over time from sitting, or flood or who knows what. I also essentially pay a tax to store it, having stuff is not free. I pay a tax on it any time I move. Maybe I won't want to do DIY in the future and will just hire out the jobs and so won't use it. There are a lot of costs and uncertainty about this future value which make it much less valuable than it naively appears.

Perhaps computer stuff is an even easier example; like I sort of need a USB hub; I could live without it and just unplug stuff to make room depending on what device I want to use at the moment. You could easily convince yourself that the value is high because "even if I don't really need it now I'll use it someday". But of course there's any number of reasons why you might not use it some day.

(* = aside : expensive cars actually aren't that expensive; if you're careful about how you buy and sell, and look for cars that are on a pretty flat part of the depreciation curve, you can get a "$100k car" that actually only costs $5k a year. That's not really a big expense in the scheme of things. However that also doesn't mean it's free; the big cost is the time spent buying and selling; if you actually want it to be low cost you have to spend a lot of time on the transaction to get good value, and for people like me that's excruciatingly painful; for people who like wheeling-and-dealing, they can do pretty well, getting almost free stuff that they are just holding temporarily between sales)

(** = more aside, and actually there is a spending leak that I have that's associated to cars and vacations; I, like most, and perhaps less than average, fall prey to the sunk cost fallacy. The sunk cost fallacy is the idea that you've spent a bunch on something, you should stick with it and spend some more. Like I've spent this much to go on vacation, I shouldn't cheap out on the dining or whatever. Or I've got an expensive car, I should buy the expensive tires. But that of course is not true. Each decision should be evaluated independently for its value; the fact that you have a large sunk cost only matters in that it changes your current situation. You don't keep chasing your flush just because you're already called some big bets (though obviously your past calls do affect the pot size which affects your current decision)).

Of course home improvements are a classic of false future value. I'm not foolish enough to think I'll get any resale value benefit, but I do fall prey to thinking "I'll enjoy this for many years" when in fact I might not.

I was thinking about buying a really good mattress that's supposed to last 30 years vs one that will only last 5. In theory the long-life one is a much better deal, but there are any number of reasons why that might not be the case. It might not last like it's supposed to; you might pee and poo on it; you might want a different size mattress. By making an "investment" what you've done is commit yourself to something, you've removed flexibility, which is a cost.

Of course if you ever decide you want to travel the world and live in apartments again, all the buying of stuff is a big liability.

Getting away from just "accumulating stuff" now :

I've been thinking lately about my career arc. All through my young-adulthood I was carefully building up my value as a software development employee. I was improving my skills, improving my profile, networking, all that stuff, going up the career ladder. During that time I was not getting paid particularly well. I took jobs based on them being good opportunities for my larger career, not for their immediate financial reward.

The problem is that the big payoff never came. When Oddworld went under I was at the point where I could have moved on to CTO-level jobs at major studios, but I decided I didn't want to do that. The stress was ruining my life (and various other things that I've blogged about back then). The point is that this "future value" I had been building suddenly became zero. If you actually want to redeem that future value, you are locking yourself in to a path, which is a major cost you are paying, giving up flexibility in your life. And in careers there are so many factors outside your control; perhaps your specialty will become less prominent in a few years. Lots of people have done things like getting a JD only to find the law job market has dried up by the time they graduate.

Saving money in general is questionable now. The governments of the world have demonstrated that they don't care about the integrity of the world financial systems, so socking money away for the future has immense risk associated with it. (I don't put much credence in the complete currency collapse alarmists, but I do believe that a long period of negative inflation-adjusted returns is very likely). In the old days we glorified the good salaryman, who worked hard and saved some money, putting the joy of today aside to build a life for themselves and their family tomorrow.

Of course we can relate this all to poker, in old skool cbloom-rants style.

One of the first big realizations I had on my own as I was getting better and moving beyond book TAG play is that implied odds are massively overvalued by most players. "implied odds" is the term used for the imagined future value that you will get if you hit a big hand. Like if I call with a flush draw, it might be a bad value based on the immediate odds, but if I hit I'll make some more money, which makes it worth it the call. The problem is that there are a wide variety of reasons why you might not get paid off even if your flush comes (scare cards, or your opponent never had a strong calling hand to begin with). Or your flush might come and he might have a better flush (negative implied odds). If you realistically weight all those undesirable outcomes, the result is that the true effect of implied odds is very small. eg. on the turn you have a 16% chance to improve, you can call a bet for zero EV if the bet is about 20% of pot size. The action of implied odds is very small; you can only call a bet that's maybe 25% of pot size; really not much more. Certainly not the 30-35% that people talk themselves into believing is correct. (and of course in no limit holdem you have to adjust for position; out of position you should consider your implied odds to be zero, chasing a draw out of position is so very bad). What I'm getting at is the imagined future value of your current investment is far lower than you imagine.

(sort of not related but "implied odds" is also a good example of the "rationalization trap". Whenever a complicated logical exercise justifies behavior that your naughty irresponsible side secretly wants to do (like chasing draws), you should be very skeptical of that logic. Whenever you read that "a little red wine is healthy" you should be very skeptical. Whenever the result of a "study" is exactly what people want to hear, beware).

This isn't really related to the "future value" mistake, but I've been mulling another spending fallacy, which is the "value of an hour" fallacy. Sometimes I'll do something like buy a tool or hire a helper because it only costs $50 and saves me an hour of work; my hour is worth more than $50, so that's a good deal, right? I'm not so sure. I feel like that line of reasoning is just a way of rationalizing more spending, but I haven't quite found the flaw in it yet.

02-18-13 - Don't write spaghetti

It never works out. I usually even warn myself about it (by writing comments to myself), but it still catches me out. I also usually give myself a todo item like "hmm this smells funny revisit it", but of course the todos just pile up in a never-ending heap, and little old ones like that get burried.


void DoLZDecompress(const char *filename,...)
{
    struct CommandInfo i;
    i.data = (void *)filename;
    // warning : passing string pointer (not copying) to another thread, make sure it's const / sticks around!
    StartJob( &i );
}

Yup, that's a crash.


void OodleIOQ_SetKickImmediate( bool kick );
/* kick state is global ; hmm should really be per-thread ; makes it a race
*/

Yup, that's a problem, which leads to the later deadlock :

void Oodle_Wait( Handle h )
{
    // @@ ? can this handle depend on un-kicked items, and hence never complete ?
    //  I used to check for that with normal deps but it's hard now with the "permits"
    ...
}

Coding crime doesn't pay. Spaghetti always gets you in the end, with its buggy staining sauce.

Whenever I have one of those "hmm this smells funny, I'm worried about the robustness of this" , yep, it's a problem.

One of my mortal enemies are the "don't worry about it, it'll be fine" people. No it will fucking not be fine. You know what will happen? It'll be a nasty god damn race bug, which I will wind up fixing while the "don't worry about it, it'll be fine" guy is watching lolcatz or browsing facebook.

2/16/2013

02-16-13 - The Reflection Visitor Pattern

Recent blog post by Maciej ( Practical C++ RTTI for games ) set me to thinking about the old "Reflect" visitor pattern.

"Reflect" is in my opinion clearly the best way to do member-enumeration in C++. And yet almost nobody uses it. A quick reminder : the reflection visitor pattern is that every class provides a member function named Reflect which takes a templated functor visitor and applies that visitor to all its members; something like :


class Thingy
{
type1 m_x;
type2 m_y;

template <typename functor>
void Reflect( functor visit )
{
    // (for all members)
    visit(m_x);
    visit(m_y);
}

};

with Reflect you can efficiently generate text IO, binary IO, tweak variable GUIs, etc.

(actually instead of directly calling "visit" you probably want to use a macro like #define VISIT(x) visit(x,#x))

A typical visitor is something like a "ReadFromText" functor. You specialize ReadFromText for the basic types (int, float), and for any type that doesn't have a specialization, you assume it's a class and call Reflect on it. That is, the fallback specialization for every visitor should be :


struct ReadFromText
{
    template <typename visiting>
    void operator () ( visiting & v )
    {
        v.Reflect( *this );
    }
}:

The standard alternative is to use some macros to mark up your variables and create a walkable set of extra data on the side. That is much worse in many ways, I contend. You have to maintain a whole type ID system, you have to have virtuals for each type of class IO (note that the Reflect pattern uses no virtuals). The Reflect method lets you use the compiler to create specializations, and get decent error messages when you try to use new visitors or new types that don't have the correct handlers.

Perhaps the best thing about the Reflect system is that it's code, not data. That means you can add arbitrary special case code directly where it's needed, rather than trying to make the macro-cvar system handle everything.

Of course you can go farther and auto-generate your Reflect function, but in my experience manual maintenance is really not a bad problem. See previous notes :

cbloom rants 04-11-07 - 1 - Reflection
cbloom rants 03-13-09 - Automatic Prefs
cbloom rants 05-05-09 - AutoReflect

Now, despite being pro-Reflect I thought I would look at some of the drawbacks.

1. Everything in headers. This is the standard C++ problem. If you truly want to be able to Reflect any class with any visitor, everything has to be in headers. That's annoying enough that in practice in a large code base you probably want to restrict to just a few types of visitor (perhaps just BinIO,TextIO), and provide non-template accessors for those.

This is a transformation that the compiler could do for you if C++ was actually well designed and friendly to programmers (grumble grumble). That is, we have something like

template <typename functor>
void Reflect( functor visit );
but we don't want to eat all that pain, so we tell the compiler which types can actually ever visit us :
void Reflect( TextIO & visit );
void Reflect( BinIO & visit );
and then you can put all the details in the body. Since C++ won't do it for you, you have to do this by hand, and it's annoying boiler-plate, but could be made easier with macros or autogen.

2. No virtual templates in C++. To call the derived-class implementation of Reflect you need to get down there in some ugly way. If you are specializing to just a few possible visitors (as above), then you can just make those virtual functions and it's no problem. Otherwise you need a derived-class dispatcher (see cblib and previous discussions).

3. Versioning. First of all, versioning in this system is not really any better or worse than versioning in any other system. I've always found automatic versioning systems to be more trouble than they're worth. The fundamental issue is that you want to be able to incrementally do the version transition (you should still be able to load old versions during development), so the objects need to know how to load old versions and convert them to new versions. So you wind up having to write custom code to adapt the old variables to the new, stuff like :


if ( version == 1 )
{
    // used to have member m_angle
    double m_angle;
    visitor(m_angle);
    m_angleCos = cos(m_angle);
}
else
{
    visitor(m_angleCos);
}

now, you can of course do this without explicit version numbers, which is my preference for simple changes. eg. when I have some text prefs and decide I want to remove some values and add new ones, you can just leave code in to handle both ways for a while :

{

#ifndef FINAL
if ( visitor.IsRead() )
{
    double m_angle = 0;
    visitor(m_angle);
    m_angleCos = cos(m_angle);
}
#endif

visitor(m_angleCos);

}

where I'm using the assumption that my IO visitor is a NOP on variables that aren't in the stream. (eg. when loading an old stream, m_angleCos won't be found and so the value from m_angle will be loaded, and when loading a new stream the initial filling from m_angle will be replaced by the later load from m_angleCos).

Anyway, the need for conversions like this has always put me off automatic versioning. But that also means that you can't use the auto-gen'ed reflection. I suspect that in large real-world code, you would wind up doing lots of little special case hacks which would prevent use of the simple auto-gen'ed reflection.

4. Note that macro-markup and Reflect() both could provide extra data, such as min & max value ranges, version numbers, etc. So that's not a reason to prefer one or the other.

5. Reflect() can be abused to call the visitor on values that are on the stack or otherwise not actually data members. Mostly that's a big advantage, it lets you do converions, and also serialize in a more human-friendly format (for text or tweakers) (eg. you might store a quaternion, but expose it to tweak/text prefs as euler angles) (*).

But, in theory with a macro-markup cvar method, you could use that for layout info of your objects, which would allow you to do more efficient binary IO (eg. by identifying big blocks of data that can be read in binary without any conversions).

(* = whenever you expose a converted version, you should also store the original form in binary so that write-then-read is a gauranteed nop ; this is of course true even for just floating point numbers that aren't printed to all their places, which is something I've talked about before).

I think this potential advantage of the cvar method is not a real advantage. Doing super-efficient binary IO should be as close to this :


void * data = Load( one file );
GameWorld * world = (GameWorld *) data;

as possible. That's going to require a whole different pathway for IO that's separate from the cvar/reflect pathway, so there's no need to consider that as part of the pro/con.

6. The End. I've never used the Reflect() pattern in the real world of a large production codebase, so I don't know how it would really fare. I'd like to try it.

old rants