9/04/2012

09-04-12 - Encoding Values in Bytes Part 4

Just some random addenda. The main post series is here :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3

1. I mentioned prefix byte length codes in the first post. This is a code in which you first send the # of bytes using a bit prefix code, then the remainder. A small note on that :

Obviously you have unary :

1 + 7 bits
10 + 14 bits
110 + 21 bits
1110 + 28 bits
If the max you send is 4 bytes then we're actually wasting a bit on the last code; it should just be 111 + 29 bits :
1 + 7 bits
10 + 14 bits
110 + 21 bits
111 + 29 bits

and obviously you have the fixed two bit code :

00 + 6 bits
01 + 14 bits
10 + 22 bits
11 + 30 bits
what I didn't mention before is that for a max of 4 bytes these are the *only* prefix codes. All other prefix codes are just the xor of these or are strictly worse.

2. The unary prefix code is always nice, because it can be decoded branchlessly with countlz (BSF/BSR). (see for example Elias Gamma or Exp-Golomb coding).

The decoder for mod-128 (with the flag bits at the head, and little endian) is :


const uint32 mod128_mask[] = { (1UL<<7)-1, (1UL<<14)-1, (1UL<<21)-1, (1UL<<29)-1 };
const uint32 mod128_base[] = { 0, (1UL<<7)-1, ((1UL<<14)-1) + ((1UL<<7)-1), 
                                ((1UL<<21)-1) + ((1UL<<14)-1) + ((1UL<<7)-1) };

int DecodeMod128LEBranchless(U8PC & from)
{
    uint32 dw = *( (uint32 *) from );
    unsigned long index;
    _BitScanForward(&index,dw);
    index = MIN(index,3);
    int count = index+1;
    int shift = MIN(count,3);
    from += count;
    dw >>= shift;
    dw &= mod128_mask[index];
    dw += mod128_base[index];
    return (int) dw;
}

and the encoder is left as an exercise for the reader. (as is an endian-independent version)

The code above provides the maximum code space; it adds the base of each range and it doesn't waste a bit for the top range. Obviously it can be faster (no MIN's) if you do waste the bit.

3. I was playing with LZ4 offsets and it's an obvious place to use EncodeMod.

Say you have an existing LZ codec that writes offsets in 2 bytes. You want to leave it alone except for the offset IO.

You can add > 65536 offset support by changing the offset to EncodeMod.

One obvious scheme is : 0 + 15 bit offsets, 1 + 23 bit offsets. This is EncodeMod( 15 bits ) on the first word followed by a raw byte (EncodeMod(0)).

But that is not ideal; you want more of the offsets in the 32768 to 65536 range to get into the first 2 bytes. Well, just lower your mod. EncodeMod( 14 bits ) gives you up to (32768+16384) in the first 2 bytes, and 22 bits for longer offsets. EncodeMod( 13 bits ) gives you up to 57344 (65536-8192) in the first 2 bytes, and 21 bits for long offsets.

The next thing you could do is be able to send a few low offsets in just 1 byte. The most important one is actually just the RLE offset (offset=1). We do not want to use something like 1 bit + 7 bit payload, because that puts too many offsets in 1 byte and it takes away too much of our 2 byte range (we still want as many offsets as possible to be sent in 2 bytes). We can get that in using EncodeMod. Instead of outputing 2 bytes at first, we do 1 byte with a high mod - something like mod 254. This saves just a few values to fit in 1 byte and most of that byte range is left over for the second byte.

So to send an LZ offset you might do : mod 255 , mod 64, mod 0 ; then the thresholds are

1,48961,4226881
for 1-3 byte offset encoding.

(I don't suggest you actually do this for LZ offsets; it's intended as an example of the kind of thinking that goes into good code design).

4. File sizes are a common case for this, so I did a brute force optimize.

I gathered all the file sizes on my computer. (you may question whether my computer is a representative sample).

(BTW reminder to self : beware circular symbolic links in dir traversals these days; lots of old software that does dir traversals can infinite loop these days)

Anyhoo I tried mod byte-by-byte encoding and optimized the parameters for a 3-step series. The optimal was :

B-B-B-mod :
best : 251, 27, 15 : 1228288
cost per :  2.129383
you can see it has reserved a little space for tiny file sizes to go in 1 byte, then most go in 2 bytes, and not much space is reserved for the very large rare files.

More efficient is an encodemod which goes word-then-bytes , and uses only pow2 mods. The optimal for that is :

W-B-pow2
best : 13 , 4 : 1232113
cost per :  2.136096
only barely higher average cost. (13,4 are bit counts; eg. mods of 1<<13 and 1<<4)

Compare to a simple scheme I've used in the past :

write 16 bits if size < (1<<15) , else 32 bits, else 64 bits :

16-32-64 : cost per : 2.33855
this is not a great scheme and yet it's only slightly worse on average; clearly not a lot of win to be had on this front.

9/02/2012

09-02-12 - Encoding Values in Bytes Part 3

Okay, now that we have some background we can bring together the two ends of the story : encoding variable length numbers, and encodings in bytes using variable ranges.

What we want to do should be obvious. We want to use some amount of range, like [0 , (T-1)] to send a value <= (T-1) immediately, and use the rest of the range [T, base-1] to send some part of a value that's >= T. (eg. base = 256 for bytewise encoding)

With our background this should be pretty obvious so I'll jump straight to the code. The trick is in the case that it doesn't all fit in the current byte, we'll use modulo to output part of the value in the space we're given.

First a recursive implementation which is more obviously symmetric between the encoder and decoder :

(In this implementation I put the "can send immediately" in the top part of the word, and I put the "here's part of the value but there's more to follow" in the bottom part of the word, which is the opposite of the description above).


typedef uint8 U8;
typedef uint8 * U8P;
typedef const uint8 * U8PC;

U8 * EncodeMod(U8 * to, int val, int mod)
{
    ASSERT( mod >= 1 && mod <= 255 );
    const int upper = 256 - mod;
    if ( val < upper )
    {
        *to++ = (U8)(mod + val);
        return to;
    }
    else
    {
        // val >= upper
        val -= upper;
        int top    = val / mod;
        int bottom = val % mod;
        *to++ = (U8)bottom;
        return EncodeMod(to,top,mod);
    }
}

int DecodeMod(U8PC & from, int mod)
{
    const int upper = 256 - mod;
    int byte = *from++;
    if ( byte >= mod )
    {
        return (byte - mod);
    }
    else
    {
        int val = DecodeMod(from,mod);
        val *= mod;
        val += byte + upper;
        return val;
    }
}

but of course you don't actually want to use recursive functions; the less obviously symmetric non-recursive implementation is :

U8 * EncodeMod(U8 * to, int val, int mod)
{
    ASSERT( mod >= 1 && mod <= 255 );
    const int upper = 256 - mod;
    for(;;)
    {
        
        if ( val < upper )
        {
            *to++ = (U8)(mod + val);
            return to;
        }
        else
        {
            // val >= upper
            val -= upper;
            int lower = val % mod;
            *to++ = (U8)lower;
            val /= mod;
        }
    }
}

int DecodeMod(U8PC & from, int mod)
{
    const int upper = 256 - mod;
    int mul = 1;
    int val = 0;
    for(;;)
    {
        int byte = *from++;
        if ( byte >= mod )
        {
            val += (byte - mod)*mul;
            break;
        }
        else
        {
            val += (byte+upper)*mul;
            mul *= mod;
        }
    }
    return val;
}

and furthermore in practice you probably don't want to eat the cost of the divide and modulo in the encoder (it's only a mul in the decoder, but still). So you can require that mod is a power of 2. The implementation is obvious :

U8 * EncodeModPow2(U8 * to, int val, int bits)
{
    ASSERT( bits >= 0 && bits < 8 );
    const int mod = (1<<bits);
    const int upper = 256 - mod;
    for(;;)
    {
        
        if ( val < upper )
        {
            *to++ = (U8)(mod + val);
            return to;
        }
        else
        {
            // val >= upper
            val -= upper;
            int lower = val & (mod-1);
            *to++ = (U8)lower;
            val >>= bits;
        }
    }
}

int DecodeModPow2(U8PC & from, int bits)
{
    const int mod = (1<<bits);
    const int upper = 256 - mod;
    int shift = 0;
    int val = 0;
    for(;;)
    {
        int byte = *from++;
        if ( byte >= mod )
        {
            val += (byte - mod)<<shift;
            return val;
        }
        else
        {
            val += (byte+upper)<<shift;
            shift += bits;
        }
    }
}

Now in all this code we're outputing a byte at a time, but of course you could start with 2 bytes, or go 1-1-2 or whatever.

EncodeMod is a superset of some of the previous encodings we've looked at.

For example, mod = 1 is a 1-1-1-1 flag value encoding. That is, we reserve only one value in the range to indicate more bytes follow.

mod = 128 is a unary flag-bit encoding (the 1 bit flag + 7 bit payload scheme).

(mod over 128 is valid but a weird thing to do (*!*). mod = 0 is send a byte. mod = 256 is send one byte from a multi-byte word; eg. to send a uint16 in 16 bits is encodemod 256 then encodemod 0 ).

In between EncodeMod can generate intermediate encodings. Here is a sampling of the values at which EncodeMod steps up to using the next # of bytes :


mod   1 : 255,510,765,1020,1275,1530,1785,2040,2295,
mod   2 : 254,762,1778,3810,7874,16002,32258,64770,129794,
mod   3 : 253,1012,3289,10120,30613,92092,276529,
mod   5 : 251,1506,7781,39156,196031,
mod   8 : 248,2232,18104,145080,
mod  13 : 243,3402,44469,578340,
mod  21 : 235,5170,108805,
mod  34 : 222,7770,264402,
mod  55 : 201,11256,619281,
mod  89 : 167,15030,1337837,
mod 144 : 112,16240,2338672,
mod 233 : 23,5382,1254029,

bits  0 : 255,510,765,1020,1275,1530,1785,2040,2295,
bits  1 : 254,762,1778,3810,7874,16002,32258,64770,129794,
bits  2 : 252,1260,5292,21420,85932,343980,
bits  3 : 248,2232,18104,145080,
bits  4 : 240,4080,65520,1048560,
bits  5 : 224,7392,236768,
bits  6 : 192,12480,798912,
bits  7 : 128,16512,2113664,

eg. mod = 1 (the old 1-1-1-1 encoding) sends values < 255 in 1 byte, < 510 in 2 bytes, etc. So for example at mod 13 we van only get up to 242 in 1 byte, but we can get a lot more in 2 bytes (< 3402).

EncodeMod is flexible but it cannot generate all possible variable length integer encodings in bytes. As a counterexample, if we try to reproduce the 2-flag-bits thresholds (0,0x40,0x4040,0x404040) the closest I can get with EncodeMod is to do mod 192, then 170, then 127, which gives thresholds of (0,0x40,0x40C0,0x408040). One of the differences here is that EncodeMod has no upper limit, it can encode infinite values (just like flag-value and unary-flag-bits encodings), but the fixed-number-of-flag bit (or finite length prefix code) type encodings do have an upper limit, so are not reserving any space for "there are still bigger values".

Finally, some charts showing the number of bytes required to send a value with various mods. The charts all show the same data, just at different zooms of the X axis. You should be able to see that low mod is best for small values, high mod is best for large values, and they cross somewhere in between. (the rightmost point is not the next byte step, it's a generated clip with the chart boundary; all the other points indicate a step up in the number of bytes output).

ADDENDUM : *!* : actually encodemod with a mod over the midpoint is totally useful and reasonable.

09-02-12 - Encoding Values in Bytes Part 2

A short post for a small concept : encoding using division of words into ranges.

If you're trying to put two different things in a byte (eg. in Part 1 we were putting either "the value fits in this byte and here it is" or "the value does not fit in this byte and here's a portion of it"), you can obviously use a flag bit.

eg. say you're trying to send either some number of apples (A's) or some number of bananas (B's) in a byte. You could send :


0 bit + 7 bits more = 7 bits of A's
1 bit + 7 bits more = 7 bits of B's

and you can write the decoder in a bit-twiddling manner :
{

    int byte = *ptr++;
    int isB   = byte & 0x80;
    int count = byte & 0x7F;

}
but you can also write the same decoder in a value-checking manner :
{
    int byte = *ptr++;

    if ( byte >= 128 )
    {
        isB = true;
        count = byte - 128;
    }
    else
    {
        isB = false;
        count = byte;
    }

}
(of course you should follow with a count++ in all cases unless you want to be able to send a count of 0)

But writing it in a value-checking manner makes it obvious that we don't need to put the divider in the middle. We could just as easily do :

{
    int byte = *ptr++;

    if ( byte >= 90 )
    {
        isB = true;
        count = byte - 90;
    }
    else
    {
        isB = false;
        count = byte;
    }

}
and now we have [0-89] for A's and [90-255] for B's ; this distribution may be better if your data is not symmetric.

To first approximation, the division point should be the probability of A's vs. B's ; eg. midpoint = P(A)*256. (this is not exactly right; it depends on the distribution of magnitudes of A's and B's ; plus there are some ugly quantization boundary effects, but it is roughly right).

Of course flag-value encodings that we saw last time are a special case of this with the divider shoved all the way to one end.

{
    int byte = *ptr++;

    if ( byte >= 255 )
    {
        isB = true;
        count = byte - 255;
    }
    else
    {
        isB = false;
        count = byte;
    }

}
and a "B" is something that doesn't fit in one byte.

Now this is obviously a kind of proto-arithmetic-coder. It doesn't carry left over range from one byte to the next, but it's trying to divide the output range of the current byte based on the probability of the symbol.

In particular for one simple case, it does exactly what you expect :

Say you have a random source of 0's and 1's (eg. bits) with some probability P0 of 0 (and 1-P0 of 1). You decide to encode them using RLE, that is you'll send a run length of either 0's or 1's. You want to put the run length in a byte (or nibble, or whatever). You can do it by sending a word which is broken into two ranges; one range for a run of 0's and another range for a run of 1's. Thusly :


EncodeRun( bool isRun0 , int runLen , int divider )
{
    ASSERT( runLen > 0 );

    if ( isRun0 )
    {
        // I get [0,divider-1] in the output

        for(;;)
        {
            if ( runLen <= divider )
            {
                *output++ = runLen-1;
                return;
            }
            else
            {
                *output++ = divider-1;
                runLen -= (divider-1);
                ASSERT( runLen >= 1 );
            }
        }
    }
    else
    {
        // I get [divider,base-1] in the output
        //  (eg. base = 256 for bytes, 16 for nibbles, etc.)

        int range = base-divider;

        for(;;)
        {
            if ( runLen <= range )
            {
                *output++ = divider + runLen-1;
                return;
            }
            else
            {
                *output++ = base-1; // divider + (range-1)
                runLen -= (range-1);
                ASSERT( runLen >= 1 );
            }
        }
    }
}

Note there's no flag for "doesn't fit in one byte" here; if we have more 0's than fit in one run, we just send another run of 0's following the first. eg. runs of 0's and 1's don't strictly alternate in this encoding.

Anyway, this encoder has just been a very concrete demonstration to get us to this statement :

For a random source with probability P0 of a 0, the optimal value of "divider" is divider = P0 * base. That is, you split the range proportional to the probability of each type of value.

(it may be either the floor or ceil of that float). (and it may not be true very close to the edges, eg. at divider=1 or base-2).

This was just an exercise to prove that our intuition is correct - the divider corresponds to the probability of each type of thing.

I'm sure that this technique has been well known to practitioners of the art forever. Back when LZRW was written, the tiniest last ounce of efficiency was needed, so the bit-packing approaches were preferred. With better CPU's now we can afford to do a branch or table lookup, which lets us put the divider away from the middle. So far as I know this technique was first used in a mainstream product in LZO (*). Recently Sean helped clarify my thinking about it.

In the next part we will bring together parts 1 and 2.

* = ADDENDUM : LZO uses this scheme to send a match or literal(s) in a byte. I don't recall the exact scheme in LZO, but say for example you wanted to send either a literal run len or a match length in an LZ encoder in a 4 bit nibble. You could send 1 bit for a match/literal flag, and then 3 bits of match len or literal run len. Instead, we now see that's equal to a threshold of 8, and we can put that threshold somewhere else to minimize our output. eg. it could be 10, then you have [0,9] for matches and [10,15] for literals.

In LZO the threshold is a constant, presumably tweaked over some test set, but other ideas are obvious. The threshold could be transmitted in the header of the file and the encoder could optimize it for the file in question. The threshold could also be dynamic, then you have an adaptive bytewise entropy coder of sorts. The adaptation schemes are obvious because as we have shown the threshold acts very much like a probability, so you can use the standard schemes that are known for binary arithmetic coders (eg. threshold += (threshold>>5) type of stuff, or rung-ladder, etc). (rung-ladder is a table-driven state machine, eg. threshold = table[state].threshold ; state = table[state].after_match and then optimize the tables somehow).

If I recall correctly, LZO also does some semi-arithmetic-codery stuff, in the sense that if not all of the range is needed to flag what you are currently sending, that value range can be carried over into a later encoding. Again, this not the exact LZO scheme but you could imagine a decoder that works like this : grab a nibble; a value in [0-11] is a match (with perhaps 1 bit of offset and the rest is match lengths), a value in [12-15] flags a literal; now obviously you don't need 4 values to send a single bit flag, so we keep (nibble-12) and add it to the next match length we decode.

ADDENDUM 2 : just to emphasize what I'm saying in the last paragraph because I think it's important.

When designing heuristic / bit-packing compression codes, you should *not* take the view point of "hey a literal flag takes 1 bit so I will allocate one bit of space for it".

What you should do is find the probability of events; (eg. literal vs. match); and allocate space in the code based on probability. Then, once you have done that, figure out a way to use the values in the code productively.

09-02-12 - Encoding Values in Bytes Part 1

A very basic level of data compression is just sending values in bytes. I'm gonna do a few posts on this topic, first some review.

A common case is you want to send some value which has a large max (or perhaps infinite or unknown max) but has a much lower mean. Obviously you don't want to just output all the bits/bytes needed for the max value all the time.

(to be quite concrete : one standard situation is transmitting a file size. A file size could take up to 64 bits but most file sizes fit in 16 bits. Sending 64 bits all the time is silly, you want some kind of variable length encoding that using 2 bytes when possible, then more if necessary. Of course because of Kraft inequality you will no longer be able to send a value of (2^64 -1) in 8 bytes, because you are compressing some of the values you must expand some)

Let's run through the standard ways to do this :

Flag Value Encodings

These reserve a flag value in the range to mean "more bytes follow". The most basic form is like :

{
    U8 *ptr; int value; // given

    while( value >= 255 )
    {
        value -= 255;
        *ptr++ = 255;
    }
    *ptr++ = value;
}
That is, we reserve one value in the byte (255) to be a flag meaning "more bytes follow". Values lower than 255 can be sent right away. If more bytes follow, we try again to just send it in one byte.

There are various other options for flag-value encodings. I like to name them by the number of bytes send in each step. So the example above is "1-1-1-1-..." encoding. (it's the "unary" of byte encoding; that is, 1-1-1-1 encoding sends small values in the fewest possible bytes, but it sends large values in the most possible bytes (of any prefix code)).

But there are other options : 1-2-4-8 (try to send just 1 byte, then try to send 2 bytes (flag value 65535), then try in 4), 1-1-2-3-4-5 , etc. (and you don't have to start with 1; for file sizes you might use 2-3-5-8 (though really for file sizes a flag bit encoding is probably better)).

In fact while the simplicity of 1-1-1-1 encoding is appealing, it's almost never the best way to go, because in the rare chance that you have some degenerate data with a very large value, it can expand a lot. (eg. to send 100,000 in 1-1-1-... encoding takes 393 bytes). Even if 1-1-1- is theoretically optimal on your data, you should use 1-1-1-2-4 or something like that to limit the maximum output in bad cases.

Here's the number of bytes sent for some flag value encodings :


  value :   1-1-1-1   1-1-2-3   1-2-3-4
    200 :         1         1         1
    400 :         2         2         3
    600 :         3         4         3
   1000 :         4         4         3
   1600 :         7         4         3
   2600 :        11         4         3
   4200 :        17         4         3
   6800 :        27         4         3
  11000 :        44         4         3
  17800 :        70         4         3
  28800 :       113         4         3
  46600 :       183         4         3
  75400 :       296         7         6

Flag Bit Encodings

The next very common method is a flag bit (or bits) to indicate more bytes follow. This lets you send large values in fewer bytes at the cost of sending small values in more bytes.

The flag bits can be sent with any prefix code (eg. Huffman, unary, etc) of course. Some common simple cases :

One flag bit at a time

(aka unary flag bits, or 7-bit words). You do this encoding thusly :

{
    U8 *ptr; int value; // given

    while( value >= 128 )
    {
        value -= 128;
        *ptr++ = value & 0x7F;
        value >>= 7;
    }
    *ptr++ = value;
}
To decode it, you check if the top bit of each byte is on. The top bit is being on means "more bytes follow". Whether or not the top bit is on, you always have 7 bits of payload to add to your value. That is, the encoding is like :

0     - 127      :  0 + 7 bits
128   - 16511    :  1 + 7 bits ; 0 + 7 bits    [or 10 + 14 bits]
16512 - 0x20407F :  1 + 7, 1 + 7, 0 + 7        [or 110 + 21 bits]

Another common case is :

N flag bits

In this case you just send a fixed number of flag bits; eg. 2 bits to 1-5 bytes. The encoding is :
0        - 63      : 00 + 6 bits
0x40     - 16447   : 01 + 6+8 bits
0x4040   - 4210751 : 10 + 6+8+8 bits
0x404040 - 1.078e9 : 11 + 6+8+8+8 bits

Now of course you can combine the two, and you can use the same kind of nomenclature as before, but this time referring to the number of flag bits ; eg. 1 flag bit, then 1 more, then 2, then 3. But as noted previously this is really just a prefix code for the number of bits.

Flag bit and flag value are really just two extremes of a related continuum, which we shall see in a later post.

9/01/2012

09-01-12 - Good Computer Days

I was throwing out some old papers and found this :

It was really nice of them to include "/loc" (perhaps by accident?).

Man I miss the days when I was excited to explore a virtual world, and felt like nobody but me and the developer had seen it before, and I drew my own maps and wrote down notes of stuff I wanted to come back to later.

I feel like the internet has ruined that. Sure you could choose not to look on the net, but that's like choosing to enter a boxing ring with a blindfold on; everyone else has a massive massive advantage over you. Even in single player games it feels dumb to me; it's like people who hike up mountains when there's a tram or road up it; for me it totally ruins the hike if I get to the top and find a parking lot full of people who got there the easy way.

One of my favorites was the Amiga game "Dungeon Master" where you made spells by putting together these elemental symbols. You would find scrolls and clues in the game with combos that worked, but you could also sort of deduce them (they were semi-logical), and when you figured one out it was like an awesome eureka moment, and you wrote it down in your little scratch pad. Nowadays that kind of system can't even be in games at all because everyone would just look up all the combos right away (some dumb devs do still try to use this kind of system, but don't let you use a combo until you have unlocked it (by purchase or level up or whatever), which ruins all the joy from it and makes it quite pointless).

09-01-12 - LZP1.h

In my continuing STB-ificiation of old code, I did a single file header version of the original "LZP1" (this is "LZP1b" in the nomenclature of this recent LZP1 post ).

It's here : lzp1.h

As noted previously, this is no longer a good way to do fast data compression; it's too branchy, it doesn't take advantage of the large amount of RAM available, etc. This is mainly a historical curiosity.

This header was just ripped out of the LZP code that has long been available here, at cbloom.com .

One thing that has really struck me, being away from compression for a while and then coming back to it, is that so many of the new ideas are just old ideas that weren't practical at the time. When you're working on something, you have tons of ideas that you throw out because they take too much memory, or they're too slow, or whatever. But in 5-10 years those ideas will be good, and 5-10 years is not really that far away. If we were dick-heads we could have patented every idea that we ruled out because it was "not practical" at the moment.

(that's not to disparage the new ideas; what most people who are not practitioners of the dark arts don't realize is that the valuable contribution is almost always the tiny details that make something work; the general idea was probably obvious to researchers 20 years ago, they just didn't pursue it for whatever reason)

8/19/2012

08-19-12 - Packages in Standard C

So we've talked about DLL's a few times and how fucked up they are. What you really want is something like a DLL that you can statically put in your app so it's not a separate file. We'll call this a "package". But I was reminded the other day that C libs are also fucked up.

Last week at RAD we discovered that some of our Xenon libs were several megabytes bigger than they need to be simply because we included "xtl.h" in a few files. What was happening was that xtl.h has a ton of inline functions in it, and the compiler goes ahead and compiles all of them and sticks them in your OBJ even if you don't use them (of course this is another problem with C that I'd like to see fixed - there's no need to waste all that time compiling functions that I don't use - but that's another rant).

Of course when you make a lib, all it does is cram together your obj's. It doesn't strip the uncalled functions (that's left to the linker, later on).

So DLL's are fucked and we wanted them to be "packages" ; and libs are also fucked and we want them to be packages too!

What I think a "package" in C should be :

  • 1. You provide an explicit list of exports (perhaps by adding an __export decorator). Only exported symbols are accessible from outside the package. (you may also have some explicit imports if you are not a leaf package).

  • 2. All internal references in the package are resolved at package build time. This lets you find "link errors" without having to make an exe that pulls in every obj in the lib to test for missing references.

    This also lets you strip all un-referenced and un-exported symbols.

  • 3. LTCG or anything funny like that which is delayed to the link stage could be done on the package.

  • 4. Symbols of the same name that are in different packages do not conflict. This is such a huge disaster in C and a source of very unpredictable and hard to deal with bugs. Just because some lib used a global "int x" and my lib uses a global "int x" doesn't mean I want them to be the same variable.

  • 5. If your package uses some libs, they should (optionally) be linked into the package and made "private" not exported; that way multiple packages that use different variants of libs can be put together in one app without conflict.

So that's all background material. What this post is about is this : it occurs to me that you can get most of this in standard C by making your own "libpackager" tool.

libpackager should take a lib and output a lib. You have to also provide it a list of exports (or you could use some decorator that it can parse to mark the exports). It can parse the obj's in the lib and find all the symbols and do its own "link" step to eliminate unreferenced symbols, then remake the obj's without those symbols. So this gives us #2, which is a pretty big win.

You could also do #4 by having libpackager decorate all the internal symbol names that are neither import nor export. This is roughly equivalent to if you had put all your internal symbols in some namespace.

You could even do #5 ; make libpackage go and grab the libs you reference, stuff their obj's into your lib also. Then your copy of the lib and your references get name-decorated so they don't conflict with someone else. eg. say you want to make "oodle.lib" as a package and you use "radmemset" from "radutil.lib" , packager could grab radutil.lib and stuff it in; then since radmemset is now an internal reference, it gets changed to "oodlelib_radmemset". Now when you put "oodle.lib" and "bink.lib" both into your app, if they used different versions of radmemset, they will not cross-link because the libs have been made into fake "packages". (this step should be optional because sometimes you do want cross-links).

One annoying complication is that this doesn't work with the stdlib in a straightforward way. I would very much like to be able to "package" all references to stdlib in this way, but stdlib is not just a normal lib, it also has some special cheating connections to the crt0 startup code, so you can't just go and rename all its symbols to oodlelib_memset and such. Perhaps this could be resolved, which would be nice to avoid all those garbage problems that arise because some lib was built for libc and some other lib was built for libcmtd , etc.

I think this all is pretty straightforward (other than the stdlib issues). The only hard part is parsing the lib and obj formats on every platform and build variant that you need to support.

(BTW a bit of web searching indicates that the gcc tools on some platforms (Mac) provide some of this; there seems to be some special attributes for exports from libs and perhaps a lib tool that does dead strips; it's hard to follow gcc docs)

8/17/2012

08-17-12 - Defines

In the category of "stuff everyone should know by now" : doing "#if" is much better than "#ifdef" for boolean toggles that you want to be able to set from the compiler command line.

The "ifdef way" is like :


code :

#ifdef STUFF
  .. stuff a ..
#else
  .. stuff b ..
#endif

command line :

compiler -DSTUFF %*
or
compiler %*

Whereas the "if way" is :

code :

#ifndef STUFF
  // stuff not set
  // could #error here
  // or #define STUFF to 0 or 1
#endif

#if STUFF
  .. stuff a ..
#else
  .. stuff b ..
#endif

command line :

compiler -DSTUFF=1 %*
or
compiler -DSTUFF=0 %*

Why is the "if way" so much better ?

1. You can tell if the user set STUFF or not. In the ifdef way, not setting it is one of the boolean values, so you can't tell if the user made any intentional selection or not. Sometimes you want to ensure that something was selected explicitly because it's too dangerous to fall back to a default automatically.

2. You can easily change the default value when STUFF is not set. You can just do #ifndef STUFF #define STUFF 0 or #ifndef STUFF #define STUFF 1. To change the default with the ifdef way, you have to change the sense of the boolean (eg. instead of STUFF use NOTSTUFF) and then all your builds break because they are setting STUFF intead of NOTSTUFF (and that breakage is totally fragile and non-detectable because of point #1).

3. There's no way to positively say "not STUFF" in the ifdef way. The way not stuff is set is by not passing anything ot the command line, but frequently it's hard to track down exactly how the command line is being set through the convoluted machinations of the IDE or make system. If some other bad part of the build script has put a -DSTUFF on your command line, you can't easily undo that by just tacking something else on the end of the command line.

I think it's incontrovertible that the "if way" is just massively better, and everyone should use it all the time, and never use ifdef. And yet I myself still use ifdef frequently. I'm not really sure why, I think it's just because I grew up using ifdef for toggles, and I'm so used to seeing it in other people's code that it just comes out of my fingers naturally.


Anyway, I was thinking about this because I had some problems with some #defines at RAD, and I chased down the problem and cleaned it up, and it seemed to me that it was a pretty good example of "cbloom style robustination". I've never met anyone who writes code quite like me (some are thankful for that, I know); I try to write code that is hard to use wrong (but without adding crazy complexity or overhead the way Herb Sutter style code does).

(disclaimer : this is not intended as a passive aggressive back-handed way of calling out some RAD coder; the RAD code in question is totally standard style that you would see anywhere, and it wasn't broken, just hard for me to use)

Anyhoo, the code in question set up the function exporting for Oodle.h ; it was controlled by two #defines :

#ifdef MAKEDLL
    #define expfunc __declspec(dllexport)
#else
#ifdef MAKEORIMPORTLIB
    #define expfunc extern
#else
    #define expfunc __declspec(dllimport)
#endif
#endif
Okay, so there are four usage cases :
1. building Oodle as a LIB - use -DMAKEORIMPORTLIB
2. building Oodle as a DLL - use -DMAKEDLL
3. building an app that uses Oodle as a LIB - use -DMAKEORIMPORTLIB
4. building an app that uses Oodle as a DLL - use no define
and that all works fine (*). But I found it hard to use; for example if I try to stick a -DMAKEXE on the command line and somebody already set -DMAKEDLL, it doesn't do what I expected; and there's no way to definitely say "I want dllimport".

(* = actually it also works if you use -DMAKEORIMPORTLIB in case 4; specifying "dllimport" for functions is actually optional and only used by the compiler as an optimization)

So anyway here's the robustinated version :


    #ifdef MAKEDLL
        #define expfunc __declspec(dllexport)
        
        #if defined(MAKELIB) || defined(IMPORTLIB) || defined(IMPORTDLL)
            #error multiple MAKE or IMPORT defines
        #endif
    #elif defined(IMPORTDLL)
        #define expfunc __declspec(dllimport)
        
        #if defined(MAKELIB) || defined(MAKEDLL) || defined(IMPORTLIB)
            #error multiple MAKE or IMPORT defines
        #endif
    #elif defined(MAKELIB)
        #define expfunc extern
        
        #if defined(MAKEDLL) || defined(IMPORTLIB) || defined(IMPORTDLL)
            #error multiple MAKE or IMPORT defines
        #endif
    #elif defined(IMPORTLIB)
        #define expfunc extern
        
        #if defined(MAKELIB) || defined(MAKEDLL) || defined(IMPORTDLL)
            #error multiple MAKE or IMPORT defines
        #endif
    #else
        #error  no Oodle usage define set
    #endif

and usage is obvious because there's a specific define for each case :
1. building Oodle as a LIB - use -DMAKELIB
2. building Oodle as a DLL - use -DMAKEDLL
3. building an app that uses Oodle as a LIB - use -DIMPORTLIB
4. building an app that uses Oodle as a DLL - use -DIMPORTDLL
and it's much harder to use incorrectly, because you have to set one and only one. Also it's a little bit less implementation tied, in the sense that the fact that MAKELIB and IMPORTLIB are actually the same thing is hidden from the user in case that ever changes.

(and of course I instinctively used #ifdef for toggles when I wrote this instead of using #if)


I used to think that "robustinated" code was the One True Way to write code, and I wrote advocacy articles about it and tried to educate others and so on. I basically have given up on that because it's too frustrating and tiring trying to convince people about coding practices. And in my old age I'm more humble and no longer so sure that it is better (because the code becomes longer, and short to-the-point code has inherent advantages; also robustination takes coder time which could be spent on other things; lastly robustination also tends to make compiles slower which hurts rapid iteration).

But I do know it's the right way for *me* to write code. When I first came to RAD I tried very hard to write code the "RAD way" so that the style would be consistent and so on. That was a huge mistake, it was very painful for me and made me write very bad code and take much longer than I should have. Only after a few years in did I realize that to be productive I have to write code my way. In particular I need the code to be very strongly self-checking.

8/12/2012

08-12-12 - Unicode on Windows Summary Page

Making another summary page for myself to link to.

Posts about the disaster of Unicode on Windows : (mainly with respect to old apps and/or console apps)

cbloom rants 06-14-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-21-08 - 3
cbloom rants 11-06-09 - IsSameFile
cbloom rants 06-07-10 - Unicode CMD Code Page Checkup
cbloom rants 10-11-10 - DeUnicode v1.0
cbloom rants 10-11-10 - Windows 1252 to ASCII best fit
cbloom rants 07-28-12 - DeUnicode 1.1

Brief summary : correctly handling unicode (*) file names in a console app on windows is almost impossible. cblib has some functions to do the best I believe you can do (MakeUnicodeNameFullMatch), but it's so complicated and error prone that I suggest you should not try it. Also never use printf with wchars, it's badly broken; do your own conversion.

(* = actually the problem occurs even for non-unicode 8-bit character names (eg. any time the "A" "OEM" and "ConsoleCP" encodings could be different); Windows console apps only work reliably on file names that are 7-bit ascii).

8/11/2012

08-11-12 - Technical Writing

Whenever I give people my technical writing to review, one of the first comments out of most people's mouths is "you need to remove the use of 'I' , and the asides, and the run-on sentences, and this bit where you say 'fuck' is unprofessional, and blah blah".

Fie! Fie I say to you!

One of the great tragedies of modern technical writing is that it has gotten so fucking standard and boring. There is absolutely no reason for it. It does not make it clearer or easier to read, in fact it makes it worse in every way - less clear, less fun, less human.

If you read actual great technical writing, it has humanity and humor. For me the absolute giants of technical writing are Feynman and Einstein. There's lots of cleverness and little winks for the advanced reader and lots of non-standard ways of writing things. If they followed Boring Technical Style Guide it would suck all the personality and beauty from their writing. (I also like Isaac Asimov's technical writing and John Baez's).

I think computer writing has become particularly bad in the last 10 years or so. The books are all Microsoft-press-style bullet point garbage. Blogs (eg. finger files) started out in the early days as sort of wonderful ramshackle things where each one was different and reflected the writer's personality, but recently there has developed this standard "technical blog style" that everyone follows.

Standard Technical Blog Style is very pedantic and condescending; the author acts like some expert from on high (regardless of their actual expertise level). There are as many self-plugs as possible. I find it vomitacious.

A while ago someone wrote a blog series about floating point stuff; it really bothered me for various reasons. One was that the topic has been covered many times in the past (by Chris Lomont for example, also FS Acton, Kahan, Hecker, etc) (if you actually want to learn about floating points, Kahan's web page is a good place to start). Another is that it just rolled out the same old crap without actually talking about solutions (like "use epsilons for floating point compares" ; wow that is super non-useful advice; tell me something real like how to make a robust BSP engine with floating point code). But maybe the most bothersome thing about it all was that it was written in Standard Boring Dicky Technical Blog Style when you can go out right now and buy a wonderful book by Forman S. Acton on floating point which is not only much much more useful, but it's also written with cleverness and humanity. (Kahan's writing is also delightfully quirky). It's kind of like taking a beautifully funky indie movie and remaking it as mainstream shlock; it's not only a waste of time, but offensive to those of us who appreciate the aesthetic pleasure that is possible in technical writing.

Anyway, if you are considering doing some blogging or technical writing, here is my advice to you :

1. Make it informal. Use I. Use incomplete sentences. Tell stories about your personal experience with the topic. When you put in some really complicated code or equations or whatever, explain what it means with colloquial, conversational english.

2. Don't look at any reference material for a writing style to copy. Their style fucking sucks. Don't copy it. If you listen to people telling you the "right way" to do things, you will be aspiring to mediocrity. (err, ahem, but do listen to me).

3. Do not use an artifical impersonal voice to add "gravity" or a false air of expertise, it doesn't work. Be humble; admit it when you aren't sure about something. Also don't pad small ideas with more text to make them seem bigger. There's nothing wrong with a one sentence idea. 90% of AltDev blogs should be one paragraph or less.

4. Do not waste time editing that could be spent making the content better. I bet you didn't actually run fair comparison tests against competing methods. Go do that instead. I will not judge you by the purpleness of your prose but rather by the content of your creation.

5. Stop writing blogs about shit that is already very well covered in books. Your writing should always be from the perspective of your domain-specific experience on a topic. Don't write yet another introduction to Quaternions, write about how you've used them differently or some application you've found that you think is worth writing about. Real domain-specific experience is what make your writing valuable.

6. Habeas Corpus. Show me the money. If you're writing about some new technique, provide code, provide an exe, prove it. If I can't repro your results, then I don't believe you. Document the tiny details and embarassing hacks. The vast majority of technical writers don't write up what they *actually* use. Instead they write up the idealized clean version of the algorithm that they think is more elegant and more scientific. Often the most useful thing in your work are the hacks for weird cases that didn't work right. People are usually too proud of the main idea; hey guess what, thousands of people have had that idea before, but didn't think it was worth pursuing or didn't get the details quite right; the value is usually in the tweak constants or the little fudgey bits that you figured out.

8/10/2012

08-10-12 - cbhashtable

cbhashtable is a single file standalone hash table. It is a power-of-two-size reprobing hash table (aka "open addressing" or "closed hashing") which uses special values for empty & deleted slots (not separate flags). It optionally stores the hash value in the table to accelerate finding when the key comparison is slow.

Download : cbhashtable.h at cbloom.com

cbhashtable was ripped out of cblib . I recently improved the cblib version so that the hash table entries can be {hash,key,data} or {hash,data} (key==data) or {key,data} (no stored hash) or just {data} (key==data and no stored hash). (or whatever you want I guess, though those are the only 4 that make sense I think).

cbhashtable is built on a vector to store its entries; you can use std::vector, or your own, or use cbvector .

See previous posts on hash tables :

cbloom rants 10-17-08 - 1
cbloom rants 10-19-08 - 1
cbloom rants 10-21-08 - 4
cbloom rants 10-21-08 - 5
cbloom rants 11-23-08 - Hashing & Cache Line Size
cbloom rants 11-19-10 - Hashes and Cache Tables
cbloom rants 11-29-10 - Useless hash test

Commentary :

I'm pretty happy with the implementation of cbhashtable now, but setting it up is still a bit awkward. (using it once its set up is fine). You have to create an "ops" functor which knows how to make & detect the special empty & deleted keys. I may try to improve this some day.

7/28/2012

07-28-12 - DeUnicode 1.1

Made a few revisions to DeUnicode. Mainly because I got sick of continuing to run into stupid file name problems, you now have the option to even further reduce the set of allowable characters.

To make file names that are (almost) guaranteed to not cause any problems, you can use

DeUnicode -a -c -s *
where -a makes the output 7-bit ascii (instead of 8-bit windows "A" code page (not ANSI), which is the default), -c removes all special characters that are used in my console ("*^;[]()" etc., as well as anything under 32), and -s removes spaces.

Once you do all that, hey you actually get a fucking file name that you can use in the real world. While *some* apps work with some of those characters and *some* command shells work with some of those characters, if you have any of them, it's quite possible you will run somebody else's wild card traverse and it will do unexpected things, which is very very very (very very very) bad. Particularly when deleting or renaming.

usage :

c:\src\deunicode>deunicode -?
DeUnicode v1.1 by cbloom
usage:
 DeUnicode [-opts] <dir>
-r : recurse
-t : special the ("the blah" -> "blah, the")
-c : remove chsh special characters
-s : remove spaces
-a : ascii output [else 'A' code page]
-d : display only (don't do)
-q : quiet
-v : verbose
DeUnicode.zip at cbloom.com

7/23/2012

07-23-12 - Structs are not what you want

I was looking at some stupid struct in my code which was something like
struct Stuff
{
    U16     m_val1;
    U32     m_val2;
    U16     m_val3;
};
because of the order of variables and the standard packing, it wound up taking 12 bytes instead of 8. In this particular case, the types were actually templates or #defined types so I don't know statically that the ordering is fucked up and causing that problem.

It occured to me that the whole C "struct" thing is really not what we want most of the time. 99% of the time I want to say "put all these values in this thing, and I don't care what order". In particular, Mr. Compiler, you can optimize the order to minimize size, or for speed, or whatever.

Now that's not a big deal with just a simple struct, but more generally it's massive.

What if I could say "bag" is some values with certain types and names. Then when I combine bags, you don't duplicate things that are in both. And you can reinterpret bags to other bags as long as they have the needed values.


bag Particle
{
    Vec3    m_pos;
    ColorDW m_color;
};

bag GameObject
{
    Vec3    m_pos;
    String  m_name;
};

bag GameObject_AndColor : GameObject
{
    ColorDW m_color;
};

now a GameObject_AndColor can be passed to any function that wants a Particle because it provides all the needed fields.
alternatively :

bag GameObjectParticle : GameObject, Particle
{
};

only has one m_pos.

Obviously this doesn't work with C functions which are compiled once and get to know the offset to the data in that one compile. You want a compiler that can compile all the logic but leave the variable references unspecified until it is bound to a concrete type.

Now obviously templates sort of address this, but in a much uglier way.

Templates are worse in 2 ways. 1. they don't do any shared compilation, they leave all compilation to the last minute when they know the concrete type they get to work on; conversely bags can do 99% of the compilation up front and just need to specialize the variable addressing per usage. 2. they don't do any static type checking; that is, when you write the template you can't specify in any kind of clean way "you can use this template on anything that provides m_pos and m_color" (C++ concepts were supposed to sort of address this) ; bags provide this very nicely and let you write a template and error-check it before you even apply it to anything.

Obviously templates are more powerful, but not useable in reality because of these problems; they delay compilation too much for widespread use, it makes compilation too slow, and puts the errors in the wrong place (at the use site, not the template itself). Compilation up front is great (I think weakly typed languages are ridiculous disasters).

But in theory bags can do even more. One annoying problem we have at RAD is that factored out functions can be massively slower than macros. There are various reasons for this, but one is that if you have some variables in registers and want to call a function on them, the compiler will almost always do a ton of work to move those variables around that it doesn't need to do (either writing them back to memory or pushing them on the stack, or just moving them to other registers).

With bags in theory you could do something like :


Vec3 pos1;
int i;
ColorDW c;

... some code that sets up pos1 and c ...

bag Particle p = { pos &= pos1, color &= c } ;  

// this is not a copy
// it says that the variables I already have can be treated as a Particle bag

Particle_Move(p);

// Particle_Move generates code that acts directly on my local variables "pos1" and "c" , no copying

The win is that Particle_Move basically acts like a macro, but I get the type-safety and separate compilation error check benefits of an inline function.

Similarly, I shouldn't have to write new code every time I have a bunch of data that I want to use as SOA (structure of arrays) instead of AOS (array of structures). eg. if I have


Vec3 positions[100];
ColorDW colors[100];

I should be able to use Particle_ functions on those, because they provide the values needed to make a valid bag.

Being a bit redundant : the classic way that simple C inheritance fails happens a lot with vertex types in graphics. It goes something like this :


struct Vertex_Pos { Vec3 pos; }

// functs that only need a "pos" member act on a Vertex_Pos

struct Vertex_PosNormal : Vertex_Pos { Vec3 normal; }

// functs that need a pos and normal act on Vertex_PosNormal

struct Vertex_PosColor : Vertex_Pos { Color color; }

// functs that need a pos and color act on Vertex_PosColor

struct Vertex_PosNormalColor : ??

// oh crap, which do I do ?

struct Vertex_PosNormalColor : Vertex_PosNormal { Color color; }
struct Vertex_PosNormalColor : Vertex_PosColor { Vec3 normal; }

// either way is busted
// with bags I should be able to just do :

bag Vertex_PosNormalColor { Vec3 pos; Vec3 normal; Color color; }

// (or either of the above inheritance ways)

and then the bag Vertex_PosNormalColor can be used as a Vertex_PosNormal or Vertex_PosColor without even explicitly specifying any inheritance. (an ideal compiler would also let you specify that as a compile-time constraint on the type).

Now obviously inheritance and virtual functions solve a slightly different problem than bags. They let you act on part of a type without knowing the concrete type. Bags have to be used on the concrete type, just like templates. I guess the fundamental problem with structs (and why they're wrong for some uses) is that they are actually solving this slightly different problem.

Anyhoo.

7/19/2012

07-19-12 - Experimental Futures in Oodle

I don't know if this will ever see the light of day, but it's fucking sexy as hell so here's a sneak peek.

"futures" implemented in C++98 with Oodle :


void example_future_comp_decomp()
{
    future<OodleBufferRC> rawBuf = oodle_readfile("r:\\oodle_example_future_input");
    
    // call :
    // oodle_compress_sync( rawBuf, OodleLZ_Compressor_LZH, OodleLZ_CompressSelect_Fast );
    // but not until rawBuf is done :
    future<OodleBufferRC> compBuf = start_future<OodleBufferRC>( oodle_compress_sync, rawBuf, OodleLZ_Compressor_LZH, OodleLZ_CompressSelect_Fast );
    
    future<const char *> write = start_future<const char*>(oodle_writefile,"r:\\oodle_example_future_comp",compBuf);
        
    future<OodleBufferRC> read_compBuf = start_future<OodleBufferRC>( oodle_readfile, write ); 
    
    future<OodleBufferRC> read_decompBuf = start_future<OodleBufferRC>( oodle_decompress_sync, read_compBuf );
    
    start_future<const char *>( oodle_writefile, "r:\\oodle_example_future_decomp",read_decompBuf);
}

This creates an async chain to read a file, compress it, write it, then read it back in, decompress it, and write out the decompressed bits.

Futures can take either immediates as arguments or other futures. If they take futures as arguments, they enqueue themself to run when their argument is ready (using the forward dependency system). Dependencies are all automatic based on function arguments; it occurs to me that this rather like the way CPU's do scheduling for out-of-order-processing.

(in contrast to idea #1 in Two Alternative Oodles , here we do not get the full async graph in advance, it's given to us as we get commands, that is, we're expected to start running things immediately when we get the command, and we don't get to know what comes next; but, just like in CPU's, the command submission normally runs slightly ahead of execution (unless our pipeline is totally empty), in which case we have a little bit of time gap)

Functions called by the future can either return values or return futures. (eg, in the code above, "oodle_readfile" could just return an OodleBufferRC directly, or it could return a future to one). If a function in a future returns a future, then the returned future replaces the original, and doesn't return to the outer scope until the chain of futures returns a non-future value. That is, this is a way of doing coroutine yields basically; when you want to yield, you instead return a future to the remaining work. (this is like the lambda-style coroutine yield that we touched on earlier). (* - see example at end)

future of course has a wait() method that blocks and returns its value. As long as you are passing futures to other futures, you never have to wait.

You can implement your own wait_all thusly :


int nop(...)
{
    return 0;
}

template <typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4>
future<int> done_when_all( t_arg1 a, t_arg2 b, t_arg3 c, t_arg4 d )
{
    return start_future<int>( nop, a,b,c,d );
}

then call

done_when_all( various futures )->wait();

A few annoying niggles due to use of old C++ :

1. I don't have lambdas so you actually have to define a function body every time you want to run something as a future.

2. I can't induce the return type of a function, so you have to explicitly specify it when you call start_future.

3. I don't have variadic templates so I have to specifically make versions of start_future<> for 0 args, 1 arg, etc. bleh. (though variadic templates are so ugly that I might choose to do it this way anyway).

Otherwise not bad. (well, that is, the client usage is not bad; like most C++ the implementation is scary as shit; also doing this type of stuff in C++ is very heavy on the mallocs (because you have to convert things into different types, and the way you do that is by new'ing something of the desired type), if you are a sane and reasonable person that should not bother you, but I know a lot of people are in fact still bothered by mallocs).

In order to automatically depend on a previous future, you need to take its return value as one of your input arguments. There's also a method to manually add dependencies on things that aren't input args. Another option is to carry over a dependency through a binding function which depends on one type and returns another, but that kind of C++ is not to my liking. (**)

To really use this kind of system nicely, you should make functions whose return value is a compound type (eg. a struct) that contains all its effects. So, for example oodle_writefile returns the name of the file written, because it modifies that object; if you had a function that modified a game object, like say an Actor *, then its return value should include that Actor *, so that you can use that to set up dependency chains. (in real code, oodle_writefile should really return a struct containing the file name and also an error code).

* : example of returning futures to continue the async job :


float test_func_5_1(int x)
{
    Sleep(1);
    return x * (2.0/3.0);
}

future<float> test_func_5_2(int x)
{
    Sleep(1);

    if ( x == 1 )
    {
        // hey in this case I can return my value immediately
        return make_future(0.6f);
    }
    else
    {
        // I need to run another async job to compute my value
        x *= 3;
        return start_future<float>(test_func_5_1,x);
    }
}


then use as :


future<float> f = start_future<float>(test_func_5_2,7);

... do other work ...

float x = f.wait();

does what it does.

This is a necessary building block, it lets you compose operations, but it's an ugly way to write coroutine-style code.

What it is good for is creating more complex functions from simpler functions, like :


future<const char *> oodle_compress_then_writefile(const char *filename, OodleBufferRC rawBuf, OodleLZ_Compressor compressor, OodleLZ_CompressSelect select )
{
    OodleBufferRC compBuf = oodle_compress_sync( rawBuf, compressor, select );

    return start_future<const char*>( oodle_writefile, filename, compBuf );
}

I believe this "future" is much better than the real C++0x std::future, which seems to be missing a lot of features.

** : example of using a binding function to carry over dependencies :


// say I want to run two asyncs :

future<int> f1 = start_future<int>( func1 );

future<float> f2 = start_future<float>( func2 , 7.5f );

// but I want to run func2 after func1
//  due to some dependency that isn't through the return value

// what I can use is a return-to-arg adapter like :

template<typename t1,typename t2>
t1 return1(t1 a,t2 b)
{
    b;
    return a;
}

template<typename t1,typename t2>
future<t1> return1_after2(t1 a,future<t2> b)
{
    return start_future<t1>( return1<t1,t2>, a, b );
}


// then run :


future<int> f1 = start_future<int>( func1 );

future<float> f2 = start_future<float>( func2 , return1_after2(7.5f,f1) );

but like I said previously I hate that kind of crap for the most part. Much better is to use the explicit dependency mechanism, like :

future<int> f1 = start_future<int>( func1 );

future<float> f2 = make_future<float>( func2 , 7.5f );
f2->add_dep(f1);
f2->start();

There is one case where the funny binding mechanism can be used elegantly; that's when you can associate the binding with the actual reason for the dependency. That is, if we require func2 to run after func1, there must be some shared variable that is causing that ordering requirement. Using a binder to associate func1 with that shared variable is a clean way of saying "you can read this var after func1 is done".

7/15/2012

07-15-12 - VideoPlayer and Commonopoly

I wrote this little videoplayer for RAD a while ago. The main purpose of it is not for consumer-style use but for developer diagnosing of video compressors.

In particular, you can give it a list of several videos on the command line, and it plays all of them. It plays them frame-exact by stepping through their frames one by one. Then you can pause and single step forward and back, and you can do split-screen or frame-toggles.

(I've always found that comparing image or video compressors is very misleading if you just look at the output; you have to look at them side-by-side with the originals, or even better do full-frame ping-ponging. For example, x264 generally looks great, but when you do full-frame ping-pongs you can see that they really fuck up the overall luma level and completely change the colors quite a bit (this is largely due to the very shitty YUV space that is standard for video, not anything in x264)).

Anyhoo, videoplayer does some nice things like prefetch and cache lots of frames (it's odd that the mainstream video players don't do this; I have gigs of ram you fuckers, prefetch some god damn video so that I can play over a network without hitching; also keep some previous frames around so I can pause and step backwards a little without seeking!).

download : videoplayer.zip (463k at cbloom.com)

(videoplayer needs radutil.dll which is in RAD Video Tools ; I put it on delay-load so you only need if you want to load a real video (not just images))

Anyhoo, I haven't touched it in a while cuz I'm off video for now. But "videoplayer" can also be pointed at a dir and it treats the images in the dir as frames of a video. I did this originally to load frame-dumps of videos that I couldn't play. For example I would use MPlayer to spit out frames with :

mplayer -benchmark -nosound -vo png:z=6 %1
md %1_frames
call mov *.png %1_frames\
and then point my videoplayer at the dir, and since it can prefetch and all that it can play a video of pngs (which load too slow to play without preloading a bunch).

But I realized the other day that I could use dir-loading to just show image slideshows too, if I use the option to manually set the frame rate to something really low like 0.1 fps. Which brings us around to the title of this post :

I discovered COMMONOPOLY a while ago; I think it's super beautiful ; the guy chooses images well and there's something about that horizontal reflection that really does some magic on the brain. If you full-screen it and stare at the middle and let your eyes defocus a bit, you can get the same image going in each eye and it's like mmm yeah good.

But obviously viewing it on the web sucks balls. So what you do is download all the images with DownloadThemAll, put them in a dir and then point videoplayer at the dir thusly :

videoplayerR.exe -f0.125 -w1 -q -0 -2 -4 -s1 -r t:\commonopoly
and then enjoy.

07-15-12 - libpng DLL Delay Load

cblib for a long time has used libpng , which creates a damn DLL dependency. Annoyingly, that happens at app startup, which means even if you are trying to load a .BMP it will fail to start the app if it can't find the PNG DLL (or, you know, using cblib on totally non-image related stuff). What I've wanted for a long time is to put the PNG DLL load into the PNG code so that it's only done if you actually try to load a PNG.

MS could have very easily put DLL load-on-first use into the thunk libs they make. That would have been nice. (ADDENDUM : I guess they did!)

Anyhoo, you can do it yourself, and it looks like this :

We need CALL_IMPORT from before :


template <typename t_func_type>
t_func_type GetImport( t_func_type * pFunc , const char * funcName, HMODULE hm )
{
    if ( *pFunc == 0 )
    {
        ASSERT( hm != 0 );
        t_func_type fp = (t_func_type) GetProcAddress( hm, funcName );
        // not optional :
        ASSERT_RELEASE_THROW( fp != 0 );
        *pFunc = fp;
    }
    return (*pFunc); 
}

#define CALL_IMPORT(name,hm) (*GetImport(&STRING_JOIN(fp_,name),STRINGIZE(name),hm))

and I'm going to just explicitly load the PNG DLL when I get to LoadPNG :

static HMODULE hm_png = 0;
#define HMODULE_FAILED  (HMODULE) 1

static bool my_png_init()
{
    if ( hm_png ) 
    {
        return ( hm_png == HMODULE_FAILED ) ? false : true;
    }
    
    HMODULE hm_zl = LoadLibrary(ZLIB_NAME);
    if ( hm_zl == 0 )
    {
        lprintf("Couldn't load Zlib (%s)\n",ZLIB_NAME);
        hm_png = HMODULE_FAILED;
        return false;
    }
    
    hm_png = LoadLibrary(PNG_NAME);
    if ( hm_png == 0 )
    {
        lprintf("Couldn't load PNG lib (%s)\n",PNG_NAME);
        hm_png = HMODULE_FAILED;
        return false;
    }
    
    lprintf_v2("Using libpng.\n");
                
    return true;
}

#define CALL_PNG(name)      CALL_IMPORT(name,hm_png)

so now we just have to replace all our png calls with CALL_PNG() calls, like :

    png_ptr = png_create_read_struct(PNG_LIBPNG_VER_STRING, NULL,NULL,NULL);

    ->

    png_ptr = CALL_PNG(png_create_read_struct)(PNG_LIBPNG_VER_STRING, NULL,NULL,NULL);

(obviously you could #define png_create_read_struct CALL_PNG(png_create_read_struct) to make it totally transparent)

Lastly we need to define all the fp_ func pointers with the right signature. This is particularly easy because png.h wraps its function externs with macros (nice, thanks guys), so we can just abuse those macros. png.h contains things like :


PNG_EXPORT(png_voidp,png_get_io_ptr) PNGARG((png_structp png_ptr));

So we can just do :

#define PNG_EXPORT(ret,func)    static ret (PNGAPI * STRING_JOIN(fp_,func))
#define PNGARG(args)            args = 0

and then the png proto is defining our fp for us. (unfortunately png.h sticks "extern" in front of all the protos, so you have to copy out the protos and take off the extern).

So anyhoo there you go, libpng use with delay-load.

BTW this also suggests a way that you can make your DLL very easy to use with delay-load and manual imports. What you should do is provide a header with only your function protos in it, and wrap them with a macro like :


DECLFUNC( ret, name , args );

eg.

DECLFUNC( png_voidp, png_get_io_ptr, (png_structp png_ptr) );

Then a client could just include that header multiple times and change DECLFUNC to various things. For example if you had a header like that, you can look up all the func names at LoadLibrary time, instead of doing each one on first use (this removes a branch from each function call site). eg :

#define DECLFUNC( ret, name , args )    ret (CALLBACK * STRING_JOIN(fp_,name)) args = 0
#include "allfuncs.inc"
#undef DECLFUNC

void ImportAllFuncs()
{
HMODULE hm = LoadLibrary;
#define DECLFUNC( ret, name , args )    STRING_JOIN(fp_,name) = ImportFunc(name,hm)
#include "allfuncs.inc"
#undef DECLFUNC
}

Unfortunately you can't do a #define inside a #define or this could be used to alias the names transparently with something like


#define DECLFUNC(ret,name,args) #define name (* STRING_JOIN(fp_,name) )
#include "allfuncs.inc"
#undef DECLFUNC

(ADDENDUM : or, you know, just use the /DELAYLOAD option)

07-15-12 - Internet Toggle

For the most part I'm not that prone to the time-wasting allure of the internet when I'm supposed to be working. But recently I've been writing the docs for Oodle and it's just excruciating boring work. Once in a while I have something important to say in the docs, but the vast majority is stuff like :

DOCGEN int Oodle_ComputeAPlusB(int A, int B);
/* Computes A + B

    $:A   the value of A in A+B
    $:B   the value of B in A+B
    $:return    the sum of and A and B

  Oh god please kill me now.

*/

Every time I wrote one of those docs I would involuntarily pop up my web browser and see if Kids on Crack had any updates. So I started turning off my internet access to break that habit.

I was using ipconfig /release and /renew to do it, but that has a few problems : 1. it's very slow (not a big deal since the toggle is rare), and 2. it also kills my VPN to work.

Jeff suggested a better way is to turn DNS off and on. I found netsh can do that pretty easily. I'd never used netsh before, it's pretty nice. The commands are :


dns_on.bat :

netsh interface ip set dns name="Wireless Network Connection" dhcp

dns_off.bat :

netsh interface ip set dns name="Wireless Network Connection" static 0.0.0.0
ipconfig /flushdns

(or, you know, slightly different as is appropriate for your net config).
(note the flushdns also, otherwise you'll have things like Google in cache).

For machines you want to still access in "internet off" mode, you can of course just use their explicit IP, or slightly nicer you can add them to your "hosts" file.

(aside : another cool feature of netsh is that you can save & restore your entire network config if you want to temporarily mess things up; just use "netsh dump" to save it and "netsh exec" to restore it).

07-15-12 - cbvector

I tried to use std::vector in one of the Oodle examples, and the freaking MSVC STL generates link dependencies by default. WTF. So I got annoyed and wrote my own single file standalone vector.

Caveats : this is ripped out of cblib (with dependencies removed), so it's rather ugly. This is not totally STL compatible. It works well enough for my purposes. Feel free to improve it.

Download : cbvector.h at cbloom.com

cbvector.h intentionally doesn't include anything. You may need to include stddef.h and/or new.h before cbvector.h.

WARNING : cbvector.h uses #defines to configure various aspects of the implementation. If you set those #defines to different things in different code files, then you must ensure that one of the following is true : 1. the CB_VECTOR class name is different in each case, or 2. the linkage of all funcs affected is "static", or 3. the entire cbvector is in an anonymous namespace (which is the default). If you're not careful about this you can get super bizarro bugs when the linker merges functions that aren't actually the same.

6/21/2012

06-21-12 - Two Alternative Oodles

I want to write about two interesting ideas for async IO systems that I am *not* doing in Oodle.

1. The graph-forwarding automated parallelism system.

The idea here is that you make all your async operations be like flow chart widgets, they have "in" and "out" channels, and then you can draw links and hook up ins to outs.

This creates dependencies automatically, each op depends on everything that feeds its input channels.

So for example you might do :


"c:\junk"  :-> OpenFile :-> fh

(that is, OpenFile takes some string as input and then puts out a file handle named "fh")

fh :->   ReadFile  :->   buf[0-32768]
buf :->
0-32768  :->

(that is, make a ReadFile op that takes the output of the Openfile, and outputs a valid buffer range)

buf[0-32768]  :->  SPU_LZCompress :-> lzSize , compbuf[0-lzSize]
compbuf

(do an LZ compress on the valid buf range and output to compressed buf)

lzSize , compbuf[0-lzSize] :-> WriteFile

etc..

This sets up a chain of operations with dependencies, you can fire it off and then wait on it all to complete. So, in a sense it's nice because you don't have to write code that waits on the completion of each op and fires the next op and so on.

There are various ways you could set up the async chains, obviously GUI tools where you drag lines around are popular for this sort of thing, but I think that's a terrible way to go. You could just have a text markup, or some creation functions that you call to build up the graph.

Another interesting option is to use the "just run the code" method. That is, you make proxy classes for all the variable types and do-nothing functions with the names of the ops (ReadFile, etc.); then you just run this fake imperative code, and all it does is record the calls and the arguments, and uses that to build the graph. That's easy enough to do for code without branches, but that's sort of a trivial case and I'm not sure how to make it work with branches. In fact in general this type of thing sucks bad for code with loops or branches.

Anyway, I think that this method is basically a terrible idea, except for one thing : creating the graph of the entire async operation before doing it can be a huge performance win. It allows you to "see the future" in terms of what the client wants to do, and thus make better scheduling decisions to maximimize utilization of your available computation resources and disk at all times.

In the simplest case, if the client calls a huge Read and the a huge LZcompress after that, that's a dumb non-parallel way to do things, but in the normal imperative Oodle I can't do anything about it, because at the time I get the Read I don't know what's coming after it. If you gave me the graph, I could go, oh hey during the Read I'm not using the CPU at all, and during the big LZCompress I'm not using the disk, so let me break those into smaller pieces and overlap them. Obviously you can schedule IO's around the timeline so that they try to be issued early enough so their results are back when needed. (though even with the full async graph you can't really schedule right unless you know how long the cpu operations are going to take).

There are even more subtle low level ways that not knowing the future gets me. In the whole worker thread system, there are crucial decisions like "should I wake up a new worker thread to take this work item, or wait for one of the existing worker threads to take it?" or "should I signal the worker threads as I make each item, or wait until I make them all?" ; or even just deciding which worker thread should take a task to maximize cache coherence. You cannot possibly get these decisions right without knowing the future.

Anyhoo, I don't think the advantage outweighs the horribleness of writing code this way, so on to the next one :

2. The coroutine auto-yielding system.

What if your file IO was always done from a coroutine, and instead of blocking when it didn't have the bytes needed, it just yielded the coroutine? That would give you fully async file IO (in the sense that you never block a thread just waiting on IO), and you could write code just like plain sync IO.

eg. you would just write :


// in a coroutine :

FILE * fp = fopen("c:\junk","rb");  // yields the coroutine for the async open

int c = fgetc(fp); // might yield for a buffer fill

etc..

This is sort of appealing, it certainly makes it easy to write async IO code. I personally don't really love the fact that the thread yielding is totally hidden, I like major functional operation to be clear in the imperative code.

The real practical problem is that you just can't make this nice in the hacky switch-based C coroutine method that I use. You really need a language that supports coroutines natively.

You can cook up a clunky way of doing this with the C coroutines, something like :


#define cofread(fp,buf,size) \
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(fp,buf,size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }


where COROUTINE_YIELD is my macro that does something like :
  self->state = 7;
  return;
  case 7:

So now you can call cofread() from an Oodle coroutine and it kind of does what we want.

But, because of the fact that we use switches and returns, we can't use any stack variables in the arguments to cofread ; eg :

{
    int size = 16384;

    cofread(fp,buf,size);
}
is no good, if you resume at the YIELD point inside cofread, "size" is gone. (you'd get a case statement skips variable initialization error or something like that).

Basically with the hacky C coroutine method you just can't do funny business like this where you hide the control flow; you have to make the yield points very explicit because they are points where you lose all your stack variables and must recreate them.

Perhaps a larger issue is that if you really were going to go with the full coroutine auto-yielding system, you'd want to be able to yield from inside function calls, not just at the root level of the coroutine. eg. you'd like to call functions that might do file IO or fire off worker tasks, and you want them to be able to yield too. That's not possible unless you have full stack-saving coroutines.

ADDENDUM for clarity :

It's totally trivial to fix the lack of stack saving in a limited way. All I have to do is reserve a few slots in the coroutine struct that cofread can use to store its variables. So cofread becomes :


#define cofread(fp,buf,size) \
    co->m_fp = fp; co->m_buf = buf; co->m_size = size;
    for(;;)
    {
        {
        OodleAsyncHandle h = Oodle_ReadOrReturnAsyncHandle(co->m_fp,co->m_buf,co->m_size);
        // Oodle_ReadOrReturnAsyncHandle returns 0 if the Read could complete without waiting
        if ( ! h )
            break;

        Coroutine_AddDependency(h);
        }
        COROUTINE_YIELD();
    
        Coroutine_FlushDoneDependencies();
    }

and now you really can use cofread within a coroutine, and you can use local variables as arguments to it, and it yields if it can't complete immediately, and that's all nice.

But it's *not* what I really want for this proposal, which is a full transparent system that a client can build their IO on. The problem is that cofread can only be called at the "root" level of a coroutine. That is, because the "yield" is not a true language yield that preserves function call stack, it must be in the base coroutine function.

eg. you can do :


MyCoroutine1( coroutine * co )
{

COROUTINE_START()

   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);

COROUTINE_DONE()

}

That's easy. But you cannnot do :

void MyHelper()
{
   g_fp = cofopen("blah");  

    cofread(g_fp,g_buf,1024);
}


MyCoroutine2( coroutine * co )
{

COROUTINE_START()

    MyHelper();

COROUTINE_DONE()

}

and that lack of composability makes it unusable as a general purpose way to do IO.

To be super clear and redundant again - Oodle of course does support and extensively uses coroutine IO, but it is for small tasks that I want to have maximum performance (like, eg. read and decompress a Package), where the limitation of having to write all your yielding code within one function is okay. The idea of proposal #2 is to make a system that is visible to the client and totally transparent, which they could use to write all their game IO.

(ASIDE : there is a way to do this in C++ in theory (but not in practice). What you do is do all your yielding at the coroutine root level still, either using the switch method or the lambda method (doesn't really matter). To do yields inside function calls, what you do is have your IO routines throw a DataNotReady exception. Then at the coroutine root level you catch that exception and yield. When you resume from the yield, you retry the function call and should make it further this time (but might throw again). To do this, all your functions must be fully rewindable, that is they should be exception safe, and should use classes that back out any uncommitted changes on destruction. I believe this makes the idea technically possible, but unusable in reality).

6/19/2012

06-19-12 - Two Learnings

Two things that I got wrong in the past and have only recently come around to what I believe is the right way.

1. Use pointer-sized ints for memory buffer sizes.

When I made the transition to building 32-64-bit cross platform I was really annoyed with the fact that I couldn't just use "int" everywhere any more. To make it easier for myself, I mostly just used int64 for memory buffers, and made a bunch of helpers for myself that returned int instead of size_t and intptr_t , eg. for things like vector.size() and strlen() and so on I used int-returning variants.

The nice thing about using int64 for memory buffer sizes is that your type doesn't change when you build different targets; that makes coding a lot simpler (and removes the possibility of bugs due to struct sizes changing and such).

Only very recently have I come around, and it's probably not for the reason you think. It's because using a separate type for "pointer sized thing" is a good way of documenting functions with types.

In the previous API post I talked about how I like the function arg types to be as self-documenting as possible. It's even better if you can make it a compile error or warning when you mis-use it. So I was looking at API's like :


Oodle_Read( OodleFile * f, void * buffer, S64 size, U64 filePos );

in particular at the call site where you have something like :

S64 i1,i2;

Oodle_Read(f,ptr,i1,i2);

you can't tell what the last two args are just by looking at the call site, and what's worse, you can switch them in the call order and get no warning. That sucks. It's much better to do :

Oodle_Read( OodleFile * f, void * buffer, SINTa size, U64 filePos );

(SINTa is the RAD pointer-sized signed int). It's now clearly documented in the variable types - this is the size of a memory buffer.

If you have an old code base, it's very annoying at first to do this, because you'll have a lot of conversions to do. It only becomes clean once you have changed your whole code base to follow the rules :


SINTA/UINTa - all memory buffer sizes and array sizes

S64/U64 - file positions and sizes

S32/U32 - really not used for much, just enums and modes and flags, not array sizes

Not only does this put more documentation on the variable, it also makes it more clear when you are doing dangerous cast-downs.

eg. when you try to read a whole file into a memory buffer. You get the file size as an S64 and then you need to pass something to malloc, which takes an SINTa. That makes a clear single point where you are doing a questionable cast. Once you've done that one cast, the rest of the code is cast-free which is nice.

Furthermore, I think it's augmented by having helpers for the cast-downs :


U32 oo64to32(U64 x);  // used super rarely, very weird thing to do
U32 ooAto32(UINTa x); // used super rarely, very weird thing to do

UINTa oo64toA(U64 x); // used for file sizes -> memory buffers, somewhat common

There are only three cast-downs needed, and really the first two should almost never be used; if you are using them a lot it's a sign of bad code. The last one is common, and it should do a check to ensure the cast is okay (eg. if using a > 2GB file size on a 32 bit system).

2. Avoid recursive mutexes.

Like many people, I read the wisdom of the ancients (old school threading programmers) who said that "recursive mutexes are of questionable value and lead to dangerous code; prefer non-recursive mutexes" ; I read that and I went "psshaw" and thought they were crotchety dinosaurs. Well, I've come around.

The thing about recursive mutexes is that like much code which is attractive to the novice, they make the trivial case simpler and look cleaner, but they make the hard case much worse, and the hard case if what actually matters.

The trivial case is : object only locks itself, never locks other objects, never can be freed while locked, etc.

But inevitably in real world threaded code you have to deal with the harder case for mutexes, which is : object might have to lock other objects (and they might have to lock it); lock order may be hard to establish; object may want to free itself while locked, etc.

The way that the novice normally makes a thread-safe object with a recursive mutex is to put a lock scoper in every function on that object :


Object::Func(...)
{
  m_mutex.Lock();

  do stuff;

  m_mutex.Unlock();
}

or really :

Object::Func(...)
{
  MUTEX_SCOPER(m_mutex);

  do stuff;
}

and crucially, when you call other member functions on yourself, that takes the mutex again recursively. (we're going to ignore the efficiency cost of taking the mutex many times). At first that way seems nice, it's easy, but then you encounter one of the real world problems and it falls apart.

What's better is to separate the mutex-taking from the actions, so instead you do :


Object::Func_Locked(...)
{
  do stuff;
}

Object::Func_Unlocked(...)
{
  MUTEX_SCOPER(m_mutex);

  Func_Locked();
}

Then all the functionality is in the _Locked functions and they can call each other.

So now you need to do something like call A->Func1 and B->Func2 , and it has to be done "atomically" , eg. with both locked. Then you run into the issue of mutex order of acquisition and possible deadlocks. If you have used the first style where objects lock themselves, then it is impossible for objects to call each other safely. That is, object A can never call object B, because object B might call object A and there's a deadlock. But with the _Locked functions, any _Locked function can call to another _Locked function, and they don't have to worry about deadlock; instead the lock taking is all pushed up and can be done in a standardized order (or even atomically if you have multi-mutex lock acquisiton).

The other thing that non-recursive locks cleans up, is that you know whenever you see a call to Unlock(), that's the *last* call to unlock and it definitely makes the object publicly visible. That is, there is a crucial transition point from "I own this object" to "others can have it", and with the recursive lock method, that transition point is muddied.

For example, consider this simple and common case : something in the object's internal state tells you it should be deleted. You must do that atomically, because if you unlock then delete, the internal state might change, eg. :


Object::Func()
{
  m_mutex.lock();

  if ( m_x > m_y )
     deleteMe = true;

  m_mutex.unlock();

  // *!*

  if ( deleteMe )
    delete this;
}

That code is no good, at the ! point, some other thread might touch object and make the check m_x > m_y no longer be true, and then we would be deleting this object incorrectly, when the invariant is not set. In order to make this work you need a combined "unlock_and_delete" operation. But to do that you need to know that your unlock is the *last* unlock - you can't do that in the recursive mutex style.

Basically the recursive mutex is sort of like optimizing the code that's already fast; it makes the simple case a bit simpler, but it makes the real world hard cases much worse. Better to just start from the beginning with the more robust long term solution.

Once you realize that the advantage of a recursive mutex is an illusion, then the advantages of the non- recursive mutex become appealing. You can implement a non-recursive mutex far more efficiently. It can take only 1 bit of memory. Every object can easily have its own non-recursive mutex and they don't even need to be allocated or initialized at all.

old rants