5/02/2013

05-02-13 - Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :


//=================================================================         
// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t
{
public:

    lf_spsc_fifo_t()
    {
        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;
    }

    ~lf_spsc_fifo_t()
    {
        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;
    }

    void push(const t_data & data)
    {
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;
    }

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
    {
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;
    }

private:

    struct node
    {
        atomic<node *>      next;
        nonatomic<t_data>   data;
        
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }
    };

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;
};

Download : cbliblf.zip

4/30/2013

04-30-13 - Packing Values in Bits - Flat Codes

One of the very simplest forms of packing values in bits is simply to store a value with non-power-of-2 range and all values of equal probability.

You have a value that's in [0,N). Ideally all code lengths would be the same ( log2(N) ) which is fractional for N not a power of 2. With just bit output, we can't write fractional bits, so we will lose some efficiency. But how much exactly?

You can of course trivially write a symbol in [0,N) by using log2ceil(N) bits. That's just going up to the next integer bit count. But you're wasting values in there, so you can take each wasted value and use it to reduce the length of a code that you need. eg. for N = 5 , start with log2ceil(N) bits :

0 : 000
1 : 001
2 : 010
3 : 011
4 : 100
x : 101
x : 110
x : 111
The first five codes are used for our values, and the last three are wasted. Rearrange to interleave the wasted codewords :
0 : 000
x : 001
1 : 010
x : 011
2 : 100
x : 101
3 : 110
4 : 111
now since we have adjacent codes where one is used and one is not used, we can reduce the length of those codes and still have a prefix code. That is, if we see the two bits "00" we know that it must always be a value of 0, because "001" is wasted. So simply don't send the third bit in that case :
0 : 00
1 : 01
2 : 10
3 : 110
4 : 111

(this is a general way of constructing shorter prefix codes when you have wasted values). You can see that the number of wasted values we had at the top is the number of codes that can be shortened by one bit.

A flat code is written thusly :


void OutputFlat(int sym, int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;
    // T is the number of "wasted values"
    if ( sym < T )
    {
        // write in B-1 bits
        PutBits(sym, B-1);
    }
    else
    {
        // write in B bits
        // push value up by T
        PutBits(sym+T, B);
    }
}

int InputFlat(int sym,int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;

    int sym = GetBits(B-1);
    if ( sym < T )
    {
        return sym;
    }
    else
    {
        // need one more bit :
        int ret = (sym<<1) - T + GetBits(1);        
        return ret;
    }
}

That is, we write (T) values in (B-1) bits, and (N-T) in (B) bits. The intlog2ceil can be slow, so in practice you would want to precompute that or pass it in as a parameter.

So, what is the loss vs. ideal, and where does it occur? Let's work it out :


H = log2(N)  is the ideal (fractional) entropy

N is in (2^(B-1),2^B]
so H is in (B-1,B]

The number of bits written by the flat code is :

L = ( T * (B-1) + (N-T) * B ) / N

with T = 2^B - N

Let's set

N = f * 2^B

with f in (0.5,1] our fractional position in the range.

so T = 2^B * (1 - f)

At f = 0.5 and 1.0 there's no loss, so there must be a maximum in that interval.

Doing some simplifying :

L = (T * (B-1) + (N-T) * B)/N
L = (T * B - T + N*B - T * B)/N
L = ( N*B - T)/N = B - T/N
T/N = (1-f)/f = (1/f) - 1
L = B - (1/f) + 1

The excess bits is :

E = L - H

H = log2(N) = log2( f * 2^B ) = B + log2(f)

E = (B - (1/f) + 1) - (B + log2(f))
E = 1 - (1/f) - log2(f)

so find the maximum of E by taking a derivative :

d/df(E) = 0
d/df(E) = 1/f^2 - (1/f)/ln2
1/f^2 = (1/f)/ln2
1/f = 1/ln(2)
f = ln(2)
f = 0.6931472...

and at that spot the excess is :

E = 1 - (1/ln2) - ln(ln2)/ln2
E = 0.08607133...

The worst case is 8.6% of a bit per symbol excess. The worst case appears periodically, once for each power of two.

The actual excess bits output for some low N's :

The worst case actually occurs as N->large, because at higher N you can get f closer to that worst case fraction (ln(2)). At lower N, the integer steps mean you miss the worst case and so waste less. This is perhaps a bit surprising, you might think that the worst case would be at something like N = 3.

In fact for N = 3 :


H = l2(3) = 1.584962 ...

L = average length written by OutputFlat

L = (1+2+2)/3 = 1.66666...

E = L - H = 0.08170421...

(obviously if you measure the loss as a percentage of the output length, the worst case is at N=3, and there it's 5.155% of the entropy).

4/21/2013

04-21-13 - How to grow Linux disk under VMWare

There's a lot of these guides around the net, but I found them all a bit confusing to follow, so my experience :
  • 1. Power off the VM.

  • 2. Make a backup of the whole VM in case something goes wrong. Just find the dir containing it and copy the whole thing.

  • 3. VMWare Settings -> Hardware -> Hard Disk -> Utilities -> Expand change to whatever size you want.

  • 4. This has just expanded the virtual disk, now you must grow the partition on the disk. Linux does not have good tools to grow a partition that is running the OS, so don't boot your VM back into Linux.

    (there is some LVM stuff that lets you make multiple partitions and then treat them as a single one, but for a Unix newb like myself that looks too scary).

  • 5. Download GParted ISO. VMWare Settings -> Hardware -> CD/DVD -> Use ISO -> point it at GParted.

    Also make sure "Connect at Power On" is checked.

  • 6. Now you have to get the virtual machine to boot from CD. Getting into the BIOS interactively was impossible for me. Fortunately VMWare has a solution. Find your VM files and edit the .vmx config file. Add this line :
    bios.forceSetupOnce = "TRUE"
    

  • 7. Power on the VM and you should enter the BIOS. Go to "Boot" and put the CD first. Save and Exit and you should now boot into GParted.

  • 8. Using GParted is pretty self explanatory. It's a good tool. When you're done, shut down and Power Off the VM.

    My VM was set up with a swap partitition, so I had to move that to the end before I could grow the primary partition. I hear that you can set up Linux with a swap file instead of a swap partition; that would be preferable. A swap partition makes zero sense in a VM where the disks are virtualized anyway (so the advantage of keeping the swap thrashing off your main disk doesn't exist). Not something I want to change though.

  • 9. VMWare Settings -> Hardware -> CD/DVD -> turn off "Connect at Power On"
It seems to me it's a good idea to leave it this way - BIOS is set to boot first from CD, but the VM is set with no CD hardware enabled. This makes it easy to change the ISO and just check that box any time you want to boot from an ISO, rather than having to go into that BIOS nightmare again.


More generally, what have I learned about multi-platform development from working at RAD ?

That it's horrible, really horrible, and I pray that I never have to do it again in my life. Ugh.

Just writing cross-platform code is not the issue (though that's horrible enough, solely due to stupid non-fundamental issues like the fact that struct packing isn't standardized, adding signed ints isn't standardized, restrict/noalias isn't standardized, inline linkage varies greatly, etc. urg wtf etc etc). If you're just releasing some code on the net and offering it for many platforms (leaving it up to the downloaders to actually build it and test it), your life is easy. The horrible part is if you actually have to maintain machines and build systems for all those platforms, test them, be able to debug on them, keep all the sdk's up to date, etc. etc.

(in general coding is easy when you don't actually test your code and make sure it works well, which a surprising number of people think is "done"; hey it compiles, I'm done! umm, no...)

(I guess that's a more general life thing; I observe a lot of people who just do things and don't actually measure whether the "doing" was successful or done well, but they just move on and are generally happy. People who stress over whether what they're doing is actually a good job or not are massively less happy but also actually do good work.)

I feel like I spend 90% of my time on stupid fucking non-algorithmic issues like this Linux partition resizing shit (probably more like 20%, but that's still frustratingly high). The regression tests are failing on Linux, okay have to figure out why, oh it's because the VM disk is too small, okay how do I fix that; or the PS4 compiler has a bug I have to work around, or the system software on this system has a bug, or the Mac clang wants to spew pointless warnings about anonymous namespaces, or my tests aren't working on Xenon .. spend some time digging .. oh the console is just turned off, or the IP changed or it got reflashed and my SDK doesn't work anymore, and blah blah fucking blah. God dammit I just want to be able to write algorithms. I miss coding, I miss thinking about hard problems. Le sigh.

I've written before about how in my imagination I could hire some kid for $50k to do all this shit work for me and it would be a huge win for me overall. But I'm afraid it's not that easy in reality.

What really should exist is a "coder cloud" service. There should be a bunch of VMs of different OS'es with various compilers and SDKs installed, so I can just say "build my shit for X with Y". Of course you need to be able to run tests on that system as well, and if something goes wrong you need remote desktop for interactive debugging. It's got to have every platform, including things like game consoles where you need license agreements, which is probably a no-go in reality because corporations are jerks. There's got to be superb customer service, because if I can't rely on it for builds at every moment of every day then it's a no-go. Unfortunately, programmers are almost uniformly overly optimistic about this kind of thing (in that they massively overestimate their own ability to manage these things quickly) so wouldn't want to pay what it costs to run that service.

4/10/2013

04-10-13 - Waitset Resignal Notes

I've been redoing my low level waitset and want to make some notes. Some previous discussion of the same issues here :

cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores

In particular, two big things occurred to me :

1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details, but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check waitset like this :


{
waiter = waitset.prepare_wait(condition);

if ( double check )
{
    waiter.cancel();
}
else
{
    waiter.wait();
    // possibly loop and re-check condition
}

}

then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of another thread that did need that signal can be missed.

Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad. In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above, it doesn't need to know about the fact that the resignal mechanism is happening at all.

So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack came to me when I thought about under what conditions this actually happens. If you recall from the earlier posts, the need for resignal comes from situations like :


T0 posts sem , and signals no one
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2

the thing is this can only happen if all the threads are awake and racing against each other (it requires a very specific interleaving); that is, the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed from :

T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait

to :

T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait

that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty rare race, so really not a bad thing.

The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the condition you want right at the same time that you check it, so your first check fails and after you prepare_wait, your second check passes. The resignal only occurs if you are in that race path, and also the setting thread sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives you that.

The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting this as standard practice in waitset.

2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.

Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages and the scheme is now :


{
waiter w;
waitset : send message { prepare_wait , &w, condition };

if ( double check )
{
    waitset : send message { cancel , &w };
    return;
}

w.wait();
}

-------

{
waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"
}

The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to ensure the consistency of the wait list.

This greatly reduces contention in the most common usage patterns :


Mutex :

only the mutex owner does Notify
 - so contention of the waitset lock is non-existant
many threads may try to lock a mutex
 - they do not have any waitset-lock contention

Semaphore :

the common case of one producer and many consumers (lots of threads do wait() )
 - zero contention of the waitset lock

the less common case of many producers and few consumers is slow

Another way to look at it is instead of doing little bits of waitlist maintenance in three different places (prepare_wait,notify,cancel) which each have to take a lock on the list, all the maintenance is moved to one spot.

Now there are some subtleties.

If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle needed per thread and you can use that to implement every threading primitive. But now you have to be able to recycle the waiter. Note that you don't have to worry about other threads using your waiter; the waiter is per-thread so you just have to worry about when you come around and use it again yourself.

If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free wait side there are some issues.

First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time (it was just a request). So if you come back around on the same thread and call prepare() again, prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have to force the Notify-side list maintenance to be done immediately.

The second related issue is that the lock-free wait-side can give you spurious signals to your waiter. Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist, you might actually still be in the waitlist from the previous time you used your waiter. Thus you can get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals (I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.


Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.

You may recall we've looked at a simple waitset-based mutex before :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,1) != 0 )
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel();
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 )
    {
        waitset.NotifyAll( &thinlock );
    }
}
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop for efficiency.

This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.

(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic" (as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))

Now that we have NotifyOne we can fix that mutex :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,2) != 0 ) // (*1)
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel(waitset_resignal_no); // (*2)
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 ) // (*3)
    {
        waitset.NotifyOne( &thinlock );
    }
}
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons (see cbloom rants 07-15-11 - Review of many Mutex implementations , where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne. (*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1 (actually it's exactly like it for non-recursive mutexes); you only need to pass on the signal when it's possible that some other thread needs to know about it. (*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing. TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.

Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :


U32 thinlock; // = 0 initializes thinlock

Lock :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinlock );

    if ( Fetch_Add(&thinlock,1) == 0 )
    {
        // got the lock (no need to resignal)
        w.Cancel(waitset_resignal_no);
        return;
    }

    w.Wait();
    // woke up - I have the lock !
}

Unlock :
{
    if ( Fetch_Add(&thinlock,-1) > 1 )
    {
        // there were waiters
        waitset.NotifyOne( &thinlock );
    }
}
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait). This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to; because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on the last unlock whether or not it needs to (eg. it will always do one Notify too many).

This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :


U32 thinsem; // = 0 initializes thinsem

Wait :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinsem );

    if ( Fetch_Add(&thinsem,-1) > 0 )
    {
        // got a dec on count
        w.Cancel(waitset_resignal_yes); // (*1)
        return;
    }

    w.Wait();
    // woke up - I got the sem !
}

Post :
{
    if ( Fetch_add(&thinsem,1) < 0 )
    {
        waitset.NotifyOne( &thinsem );
    }
}

which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal, because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time; and the mutex itself serializes the unlocks).


Oh, one very subtle issue that I only discovered due to relacy :

waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard pattern for any kind of "Publish" is something like :


Publish
{
    change shared variable
    if ( any waiters )
    {
        #StoreLoad
        waitset.Notify()
    }
}

Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do some kind of more weakly ordered primitive, then you must force a barrier there.

This is the exact same issue that I've run into before and forgot about again :

cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3

4/04/2013

04-04-13 - Tabdir

I made a fresh build of "tabdir" my old ( old ) utility that does a recursive dirlisting in tabbed-text format for "tabview".

Download : tabdir 320k zip

tabdir -?
usage : tabdir [opts] [dir]
options:
 -v  : view output after writing
 -p  : show progress of dir enumeration (with interactive keys)
 -w# : set # of worker threads
 -oN : output to name N [r:\tabdir.tab]

This new tabdir is built on Oodle so it has a multi-threaded dir lister for much greater speed. (*)

Also note to self : I fixed tabview so it works as a shell file association. I hit this all the time and always forget it : if something works on the command line but not as a shell association, it's probably because the shell passes you quotes around file names, so you need a little code to strip quotes from args.

Someday I'd like to write an even faster tabdir that reads the NTFS volume directory information directly, but chances are that will never happen.

One odd thing I've spotted with this tabdir is that the Windows SxS Assembly dirs take a ton of time to enumerate on my machine. I dunno if they're compressed or WTF the deal is with them (I pushed it on the todo stack to investigate), but they're like 10X slower than any other dir. (could just be the larger number of files in there; but I mean it's slow *per file*)

I never did this before because I didn't expect multi-threaded dir enumeration to be a big win; I thought it would just cause seek thrashing, and if you're IO bound anyway then multi-threading can't help, can it? Well, it turns out the Win32 dir enum functions have quite a lot of CPU overhead, so multi-threading does in fact help a bit :

nworkers| elapsed time
1       | 12.327
2       | 10.450
3       | 9.710
4       | 9.130

(* = actually the big speed win was not multi-threading, it's that the old tabdir did something rather dumb in the file enum. It would enum all files, and then do GetInfo() on each one to get the file sizes. The new one just uses the file infos that are returned as part of the Win32 enumeration, which is massively faster).

04-04-13 - Worker Thread Forward Permit Delay-Kicks

I've had this small annoying issue for a while, and finally thought of a pretty simple solution.

You may recall, I use a worker thread system with forward "permits" (reversed dependencies) . When any handle completes it sees if that completion should trigger any followup handles, and if so those are then launched. Handles may be SPU jobs or IOs or CPU jobs or whatever. The problem I will talk about occurred when the predessor and the followup were both CPU jobs.

I'll talk about a specific case to be concrete : decoding compressed data while reading it from disk.

To decode each chunk of LZ data, a chunk-decompress job is made. That job depends on the IO(s) that read in the compressed data for that chunk. It also depends on the previous chunk if the chunk is not a seek-reset point. So in the case of a non-reset chunk, you have a dependency on an IO and a previous CPU job. Your job will be started by one or the other, whichever finishes last.

Now, when decompression was IO bound, then the IO completions were kicking off the decompress jobs, and everything was fine.

In these timelines, the second line is IO and the bottom four are workers. (click images for higher res)

LZ Read and Decompress without seek-resets, IO bound :

You can see the funny fans of lines that show the dependency on the previous decompress job and also the IO. Yellow is a thread that's sleeping.

You may notice that the worker threads are cycling around. That's not really ideal, but it's not related to the problem I'm talking about today. (that cycling is caused by the fact that the OS semaphore is FIFO. For something like worker threads, we'd actually rather have a LIFO semaphore, because it makes it more likely that you get a thread with something useful still hot in cache. Someday I'll replace my OS semaphore with my own LIFO one, but for now this is a minor performance bug). (Win32 docs say that they don't gaurantee any particular order, but in my experience threads of equal priority are always FIFO in Win32 semaphores)

Okay, now for the problem. When the IO was going fast, so we were CPU bound, it's the prior decompress job that triggers the followup work.

But something bad happened due to the forward permit system. The control flow was something like this :


On worker thread 0

wake from semaphore
do on an LZ decompress job
mark job done
completion change causes a permits check
permits check sees that there is a pending job triggered by this completion
  -> fire off that handle
   handle is pushed to worker thread system
   no worker is available to do it, so wake a new worker and give him the job
finalize (usually delete) job I just finished
look for more work to do
   there is none because it was already handed to a new worker

And it looked like this :

LZ Read and Decompress without seek-resets, CPU bound, naive permits :

You can see each subsequent decompress job is moving to another worker thread. Yuck, bad.

So the fix in Oodle is to use the "delay-kick" mechanism, which I'd already been using for coroutine refires (which had a similar problem; the problem occurred when you yielded a coroutine on something like an IO, and the IO was done almost immediately; the coroutine would get moved to another worker thread instead of just staying on the same one and continuing from the yield as if it wasn't there).

The scheme is something like this :


On each worker thread :

Try to pop a work item from the "delay kick queue"
  if there is more than one item in the DKQ,
    take one for myself and "kick" the remainder
    (kick means wake worker threads to do the jobs)

If nothing on DKQ, pop from the main queue
  if nothing on main queue, wait on work semaphore

Do your job

Set "delay kick" = true
  ("delay kick" has to be in TLS of course)
Mark job as done
Permit system checks for successor handles that can now run 
  if they exist, they are put in the DKQ instead of immediately firing
Set "delay kick" = false

Repeat

In brief : work that is made runnable by the completion of work is not fired until the worker thread that did the completion gets its own shot at grabbing that new work. If the completion made 4 jobs runnable, the worker will grab 1 for itself and kick the other 3. The kick is no longer in the completion phase, it's in the pop phase.

And the result is :

LZ Read and Decompress without seek-resets, CPU bound, delay-kick permits :

Molto superiore.

These last two timelines are on the same time scale, so you can see just from the visual that eliminating the unnecessary thread switching is about a 10% speedup.

Anyway, this particular issue may not apply to your worker thread system, or you may have other solutions. I think the main take-away is that while worker thread systems seem very simple to write at first, there's actually a huge amount of careful fiddling required to make them really run well. You have to be constantly vigilant about doing test runs and checking threadprofile views like this to ensure that what's actually happening matches what you think is happening. Err, oops, I think I just accidentally wrote an advertisement for Telemetry .

04-04-13 - Oodle Compression on BC1 and WAV

I put some stupid filters in, so here are some results for the record and my reference.

BC1 (DXT1/S3TC) DDS textures :

All compressors run in max-compress mode. Note that it's not entirely fair because Oodle has the BC1 swizzle and the others don't.

Some day I'd like to do a BC1-specific encoder. Various ideas and possibilities there. Also RD-DXTC.

I also did a WAV filter. This one is particularly ridiculous because nobody uses WAV, and if you want to compress audio you should use a domain-specific compressor, not just OodleLZ with a simple delta filter. I did it because I was annoyed that RAR beat me on WAVs (due to its having a multimedia filter), and RAR should never beat me.

WAV compression :

See also : same chart with 7z (not really fair cuz 7z doesn't have a WAV filter)

Happy to see that Oodle-filt handily beats RAR-filt. I'm using just a trivial linear gradient predictor :


out[i] = in[i] - 2*in[i-1] + in[i-2]

this could surely be better, but whatever, WAV filtering is not important.

I also did a simple BMP delta filter and EXE (BCJ/relative-call) transform. I don't really want to get into the business of offering all kinds of special case filters the way some of the more insane modern archivers do (like undoing ZLIB compression so you can recompress it, or WRT), but anyhoo there's a few.


ADDED : I will say something perhaps useful about the WAV filter.

There's a bit of a funny issue because the WAV data is 16 bit (or 24 or 32), and the back-end entropy coder in a simple LZ is 8 bit.

If you just take a 16-bit delta and put it into bytes, then most of your values will be around zero, and you'll make a stream like :

[00 00] [00 01] [FF FF] [FF F8] [00 04] ...
The bad thing you should notice here are the high bytes are switching between 00 and FF even though the values have quite a small range. (Note that the common thing of centering the values with +32768 doesn't change this at all).

You can make this much better just by doing a bias of +128. That makes it so the most important range of values (around zero (specifically [-128,127])) all have the same top byte.

I think it might be even slightly better to do a "folded" signed->unsigned map, like

{ 0,-1,1,-2,2,-3,...,32767,-32768 }
The main difference being that values like -129 and +128 get the same high byte in this mapping, rather than two different high bytes in the simple +128 bias scheme.

Of course you really want a separate 8-bit huffman for alternating pairs of bytes. One way to get that is to use a few bottom bits of position as part of the literal context. Also, the high byte should really be used as context for the low byte. But both of those are beyond the capabilities of my simple LZ-huffs so I just deinterleave the high and low bytes to two streams.

4/02/2013

04-02-13 - The Method of Random Holdouts

Quick and dirty game-programming style hacky way to make fitting model parameters somewhat better.

You have some testset {T} of many items, and you wish to fit some heuristic model M over T which has some parameters. There may be multiple forms of the model and you aren't sure which is best, so you wish to compare models against each other.

For concreteness, you might imagine that T is a bunch of images, and you are trying to make a perceptual DXTC coder; you measure block error in the encoder as something like (SSD + a * SATD ^ b + c * SSIM_8x8 ) , and the goal is to minimize the total image error in the outer loop, measured using something complex like IW-MS-SSIM or "MyDCTDelta" or whatever. So you are trying to fit the parameters {a,b,c} to minimize an error.

For reference, the naive training method is : run the model on all data in {T}, optimize parameters to minimize error over {T}.

The method of random holdouts goes like this :


Run many trials

On each trial, take the testset T and randomly separate it into a training set and a verification set.
Typically training set is something like 75% of the data and verification is 25%.

Optimize the model parameters on the {training set} to minimize the error measure over {training set}.

Now run the optimized model on the {verification set} and measure the error there.
This is the error that will be used to rate the model.

When you make the average error, compensate for the size of the model thusly :
average_error = sum_error / ( [num in {verification set}] - [dof in model] )

Record the optimal parameters and the error for that trial

Now you have optimized parameters for each trial, and an error for each trial. You can take the average over all trials, but you can also take the sdev. The sdev tells you how well your model is really working - if it's not close to zero then you are missing something important in your model. A term with a large sdev might just be a random factor that's not useful in the model, and you should try again without it.

The method of random holdouts reduces over-training risk, because in each run you are measuring error only on data samples that were not used in training.

The method of random holdouts gives you a decent way to compare models which may have different numbers of DOF. If you just use the naive method of training, then models with more DOF will always appear better, because they are just fitting your data set.

That is, in our example say that (SSD + a * SATD ^ b) is actually the ideal model and the extra term ( + c * SSIM_8x8 ) is not useful. As long as it's not just a linear combo of other terms, then naive training will find a "c" such that that term is used to compensate for variations in your particular testset. And in fact that incorrect "c" can be quite a large value (along with a negative "a").

This kind of method can also be used for fancier stuff like building complex models from ensembles of simple models, "boosting" models, etc. But it's useful even in this case where we wind up just using a simple linear model, because you can see how it varies over the random holdouts.

3/31/2013

03-31-13 - Market Equilibrium

I'm sure there's some standard economic theory of all this but hey let's reinvent the wheel without any background.

There's a fundamental principal of any healthy (*) market that the reward for some labor is equal across all fields - proportional only to standard factors like the risk factor, the scarcity of labor, the capital required for entry, etc. (* = more on "healthy" later). The point is that those factors have *nothing* to do with the details of the field.

The basic factor at play is that if some field changes and suddenly becomes much more profitable, then people will flood into that field, and the risk-equal-capital-return will keep going down until it becomes equal to other fields. Water flows downhill, you know.

When people like Alan Greenspan try to tell you that oh this new field is completely unlike anything we've seen in the past because of blah blah - it doesn't matter, they may have lots of great points that seem reasonable in isolation, but the equilibrium still applies. The pay of a computer programmer is set by the pay of a farmer, because if the difference were out of whack, the farmer would quit farming and start programming; the pay of programmers will go down and the wages of farmers will go up, then the price of lettuce will go up, and in the end a programmer won't be able to buy any more lettuce than anyone else in a similar job. ("similar" only in terms of risk, ease of entry, rarity of talent, etc.)

We went through a drive-through car wash yesterday and Tasha idly wondered how much the car wash operator makes from an operation like that. Well, I bet it's about the same as a quick-lube place makes, and that's about the same as a dry cleaner, and it's about the same as a pizza place (which has less capital outlay but more risk), because if one of them was much more profitable, there would be more competition until equilibrium was reached.

Specifically I've been thinking about this because of the current indie game boom on the PC, which seems to be a bit of a magic gold rush at the moment. That almost inevitably has to die out, it's just a question of when. (so hurry up and get your game out before it does!).

But of course that leads us into the issue of broken markets, since all current game sales avenues are deeply broken markets.

Equilibrium (like most naive economic theory) only applies to markets where there's fluidity, robust competition, no monopolistic control, free information, etc. And of course those don't happen in the real world.

Whenever a market is not healthy, it provides an opportunity for unbalanced reward, well out of equilibrium.

Lack of information can be particularly be a factor in small niches. There can be a company that does something random like make height-adjustable massage tables. If they're a private operation and nobody really pays attention to them, they can have super high profit levels for something that's not particularly difficult - way out of equilibrium. If other people knew how easy that business was, lots of others would enter, but due to lack of information they don't.

Patents and other such mechanisms that create legally enforced distortions of the market. Of course things like the cable and utility systems are even worse.

On a large scale, government distortion means that huge fields like health care, finance, insurance, oil, farming, etc. are all more profitable than they should be.

Perhaps the biggest issue in downloadable games is the oligopoly of App Store and Steam. This creates an unhealthy market distortion and it's hard to say exactly what the long term affect of that will be. (of course you don't see it as "unhealthy" if you are the one benefiting from the favor of the great powers; it's unhealthy in a market fluidity and fair competition sense, and may slow or prevent equilibrium)

Of course new fields are not yet in equilibrium, and one of the best ways to "get rich quick" is to chase new fields. Software has been out of equilibrium for the past 50 years, and is only recently settling down. Eventually software will be a very poorly paid field, because it requires very little capital to become a programmer, it's low risk, and there are lots of people who can do it.

Note that in *every* field the best will always rise to the top and be paid accordingly.

Games used to be a great field to work in because it was a new field. New fields are exciting, they offer great opportunities for innovation, and they attract the best people. Mature industries are well into equilibrium and the only chances for great success are through either big risk, big capital investment, or crookedness.

03-31-13 - Index - Game Threading Architecture

Gathering the series for an index post :

cbloom rants 08-01-11 - A game threading model
cbloom rants 12-03-11 - Worker Thread system with reverse dependencies
cbloom rants 03-05-12 - Oodle Handle Table
cbloom rants 03-08-12 - Oodle Coroutines
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts
cbloom rants 12-18-12 - Async-Await ; Microsoft's Coroutines
cbloom rants 12-21-12 - Coroutine-centric Architecture
cbloom rants 12-21-12 - Coroutines From Lambdas
cbloom rants 12-06-12 - Theoretical Oodle Rewrite Continued
cbloom rants 02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

I believe this is a good architecture, using the techniques that we currently have available, without doing anything that I consider bananas like writing your own programming language (*). Of course if you are platform-specific or know you can use C++11 there are small ways to make things more convenient, but the fundamental architecture would be about the same (and assuming that you will never need to port to a broken platform is a mistake I know well).

(* = a lot of people that I consider usually smart seem to think that writing a custom language is a great solution for lots of problems. Whenever we're talking about "oh reflection in C is a mess" or "dependency analysis should be automatic", they'll throw out "well if you had the time you would just write a custom language that does all this better". Would you? I certainly wouldn't. I like using tools that actually work, that new hires are familiar with, etc. etc. I don't have to list the pros of sticking with standard languages. In my experience every clever custom language for games is a huge fucking disaster and I would never advocate that as a good solution for any problem. It's not a question of limitted dev times and budgets.)

03-31-13 - Endian-Independent Structs

I dunno, maybe this is common practice, but I've never seen it before.

The easy way to load many file formats (I'll use a BMP here to be concrete) is just to point a struct at it :


struct BITMAPFILEHEADER
{
    U16 bfType; 
    U32 bfSize; 
    U16 bfReserved1; 
    U16 bfReserved2; 
    U32 bfOffBits; 
} __attribute__ ((__packed__));


BITMAPFILEHEADER * bmfh = (BITMAPFILEHEADER *)data;

if ( bmfh->bfType != 0x4D42 )
    ERROR_RETURN("not a BM",0);

etc..

but of course this doesn't work cross platform.

So people do all kinds of convoluted things (which I have usually done), like changing to a method like :


U16 bfType = Get16LE(&ptr);
U32 bfSize = Get32LE(&ptr);

or they'll do some crazy struct-parse fixup thing which I've always found to be bananas.

But there's a super trivial and convenient solution :


struct BITMAPFILEHEADER
{
    U16LE bfType; 
    U32LE bfSize; 
    U16LE bfReserved1; 
    U16LE bfReserved2; 
    U32LE bfOffBits; 
} __attribute__ ((__packed__));

where U16LE is just U16 on little-endian platforms and is a class that does bswap on itself on big-endian platforms.

Then you can still just use the old struct-pointing method and everything just works. Duh, I can't believe I didn't think of this earlier.

Similarly, here's a WAV header :


struct WAV_header_LE
{
    U32LE FOURCC_RIFF; // RIFF Header 
    U32LE riffChunkSize; // RIFF Chunk Size 
    U32LE FOURCC_WAVE; // WAVE Header 
    U32LE FOURCC_FMT; // FMT header 
    U32LE fmtChunkSize; // Size of the fmt chunk 
    U16LE audioFormat; // Audio format 1=PCM,6=mulaw,7=alaw, 257=IBM Mu-Law, 258=IBM A-Law, 259=ADPCM 
    U16LE numChan; // Number of channels 1=Mono 2=Sterio 
    U32LE samplesPerSec; // Sampling Frequency in Hz 
    U32LE bytesPerSec; // bytes per second 
    U16LE blockAlign; // normall NumChan* bytes per sample
    U16LE bitsPerSample; // Number of bits per sample 
}  __attribute__ ((__packed__));;

easy.

For file-input type structs, you just do this and there's no penalty. For structs you keep in memory you wouldn't want to eat the bswap all the time, but even in that case this provides a simple way to get the swizzle into native structs by just copying all the members over.

Of course if you have the Reflection-Visitor system that I'm fond of, that's also a good way to go. (cursed C, give me a "do this macro on all members").

3/30/2013

03-30-13 - Error codes

Some random rambling on the topic of returning error codes.

Recently I've been fixing up a bunch of code that does things like

void  MutexLock( Mutex * m )
{
    if ( ! m ) return;
    ...

yikes. Invalid argument and you just silently do nothing. No thank you.

We should all know that silently nopping in failure cases is pretty horrible. But I'm also dealing with a lot of error code returns, and it occurs to me that returning an error code in that situation is not much better.

Personally I want unexpected or unhandleable errors to just blow up my app. In my own code I would just assert; unfortunately that's not viable in OS code or perhaps even in a library.

The classic example is malloc. I hate mallocs that return null. If I run out of memory, there's no way I'm handling it cleanly and reducing my footprint and carrying on. Just blow up my app. Personally whenever I implement an allocator if it can't get memory from the OS it just prints a message and exits (*).

(* = aside : even better is "functions that don't fail" which I might write more about later; basically the idea is the function tries to handle the failure case itself and never returns it out to the larger app. So in the case of malloc it might print a message like "tried to alloc N bytes; (a)bort/(r)etry/return (n)ull?". Another common case is when you try to open a file for write and it fails for whatever reason, it should just handle that at the low level and say "couldn't open X for write; (a)bort/(r)etry/change (n)ame?" )

I think error code returns are okay for *expected* and *recoverable* errors.

On functions that you realistically expect to always succeed and will not check error codes for, they shouldn't return error codes at all. I wrote recently about wrapping system APIs for portable code ; an example of the style of level 2 wrapping that I like is to "fix" the error returns.

(obviously this is not something the OS should do, they just have to return every error; it requires app-specific knowledge about what kind of errors your app can encounter and successfully recover from and continue, vs. ones that just mean you have a catastrophic unexpected bug)

For example, functions like lock & unlock a mutex shouldn't fail (in my code). 99% of the user code in the world that locks and unlocks mutexes doesn't check the return value, they just call lock and then proceed assuming the lock succeeded - so don't return it :


void mypthread_mutex_lock(mypthread_mutex_t *mutex)
{
    int ret = pthread_mutex_lock(mutex);
    if ( ret != 0 )
        CB_FAIL("pthread_mutex_lock",ret);
}

When you get a crazy unexpected error like that, the app should just blow up right at the call site (rather than silently failing and then blowing up somewhere weird later on because the mutex wasn't actually locked).

In other cases there are a mix of expected failures and unexpected ones, and the level-2 wrapper should differentiate between them :


bool mysem_trywait(mysem * sem)
{
    for(;;)
    {
        int res = sem_trywait( sem );
        if ( res == 0 ) return true; // got it

        int err = errno;
        if ( err == EINTR )
        {
            // UNIX is such balls
            continue;
        }
        else if ( err == EAGAIN )
        {
            // expected failure, no count in sem to dec :
            return false;
        }
        else
        {
            // crazy failure; blow up :
            CB_FAIL("sem_trywait",err);
        }
    }
}

(BTW best practice these days is always to copy "errno" out to an int, because errno may actually be #defined to a function call in the multithreaded world)

And since I just stumbled into it by accident, I may as well talk about EINTR. Now I understand that there may be legitimate reasons why you *want* an OS API that's interrupted by signals - we're going to ignore that, because that's not what the EINTR debate is about. So for purposes of discussion pretend that you never have a use case where you want EINTR and it's just a question of whether the API should put that trouble on the user or not.

I ranted about EINTR at RAD a while ago and was informed (reminded) this was an ancient argument that I was on the wrong side of.

Mmm. One thing certainly is true : if you want to write an operating system (or any piece of software) such that it is easy to port to lots of platforms and maintain for a long time, then it should be absolutely as simple as possible (meaning simple to implement, not simple in the API or simple to use), even at the cost of "rightness" and pain to the user. That I certainly agree with; UNIX has succeeded at being easy to port (and also succeeded at being a pain to the user).

But most people who argue on the pro-EINTR side of the argument are just wrong; they are confused about what the advantage of the pro-EINTR argument is (for example Jeff Atwood takes off on a general rant against complexity ; I think we all should know by now that huge complex APIs are bad; that's not interesting, and that's not what "Worse is Better" is about; or Jeff's example of INI files vs the registry - INI files are just massively better in every way, it's not related at all, there's no pro-con there).

(to be clear and simple : the pro-EINTR argument is entirely about simplicity of implementation and porting of the API; it's about requiring the minimum from the system)

The EINTR-returning API is not simpler (than one that doesn't force you to loop). Consider an API like this :


U64 system( U64 code );

doc :

if the top 32 bits of code are 77 this is a file open and the bottom 32 bits specify a device; the
return values then are 0 = call the same function again with the first 8 chars of the file name ...
if it returns 7 then you must sleep at least 1 milli and then call again with code = 44 ...
etc.. docs for 100 pages ...

what you should now realize is that *the docs are part of the API*. (that is not a "simple" API)

An API that requires you to carefully read about the weird special cases and understand what is going on inside the system is NOT a simple API. It might look simple, but it's in disguise. A simple API does what you expect it to. You should be able to just look at the function signature and guess what it does and be right 99% of the time.

Aside from the issue of simplicity, any API that requires you to write the exact same boiler-plate every time you use it is just a broken fucking API.

Also, I strongly believe that any API which returns error codes should be usable if you don't check the error code at all. Yeah yeah in real production code of course you check the error code, but for little test apps you should be able to do :


int fd = open("blah");

read(fd,buf);

close(fd);

and that should work okay in my hack test app. Nope, not in UNIX it doesn't. Thanks to its wonderful "simplicity" you have to call "read" in a loop because it might decide to return before the whole read is done.

Another example that occurs to me is the reuse of keywords and syntax in C. Things like making "static" mean something completely different depending on how you use it makes the number of special keywords smaller. But I believe it actually makes the "API" of the language much *more* complex. Instead of having intuitive and obvious separate clear keywords for each meaning that you could perhaps figure out just by looking at them, you instead have to read a bunch of docs and have very technical knowledge of the internals of what the keywords mean in each usage. (there are legitimate advantages to minimizing the number of keywords, of course, like leaving as many names available to users as possible). Knowledge required to use an API is part of the API. Simplicity is determined by the amount of knowledge required to do things correctly.

3/26/2013

03-26-13 - Simulating Deep Yield with a Wait

I'm becoming increasingly annoyed at my lack of "deep yield" for coroutines.

Any time you are in a work item, if you decide that you can get some more parallelism by doing a branch-merge inside that item, you need deep yield.

Remember you should never ever do an OS wait on a coroutine thread (with normal threads anyway; on a WinRT threadpool thread you can). The reason is the OS wait disables that worker thread, so you have one less. In the worst case, it leads to deadlock, because all your worker threads can be asleep waiting on work items, and with no worker threads they will never get done.

Anyway, I've cooked up a temporary work-around, it looks like this :


I'm in some function and I want to branch-merge

If I'm not on on a worker thread
  -> just do a normal branch-merge, send the work off and use a Wait for completion

If I am on a worker thread :

inc target worker thread count
if # currently live worker threads is < target count
  start a new worker thread (either create or wake from pool)

now do the branch-merge and use OS Wait
dec the target worker thread count


on each worker thread, after completing a work item and before popping more work :
if target worker thread count < currently live count
  stop self (go back into a sleeping state in the pool)

this is basically using OS threads to implement stack-saving deep yield. It's not awesome, but it is okay if deep yield is rare.

03-26-13 - Oodle 1.1 and GDC

Hey it's GDC time again, so if you're here come on by the RAD booth and say "hi" (or "fuck you", or whatever).

The Oodle web site just went live a few days ago.

Sometimes I feel embarassed (ashamed? humiliated?) that it's taken me five years to write a file IO and data compression library. Other times I think I've basically written an entire OS by myself (and all the docs, and marketing materials, and a video compressor, and aborted paging engine, and a bunch of other crap) and that doesn't sound so bad. I suppose the truth is somewhere in the middle. (perhaps with Oodle finally being officially released and selling, I might write a little post-mortem about how it's gone, try to honestly look back at it a bit. (because lord knows what I need is more introspection in my life)).

Oodle 1.1 will be out any day now. Main new features :


Lots more platforms.  Almost everything except mobile platforms now.

LZNIB!  I think LZNIB is pretty great.  8X faster to decode than ZLIB and usually
makes smaller files.

Other junk :
All the compressors can run parallel encode & decode now.
Long-range-matcher for LZ matching on huge files (still only in-memory though).
Incremental compressors for online transmission, and faster resets.

Personally I'm excited the core architecture is finally settling down, and we have a more focused direction to go forward, which is mainly the compressors. I hope to be able to work on some new compressors for 1.2 (like a very-high-compression option, which I currently don't have), and then eventually move on to some image compression stuff.

3/19/2013

03-19-13 - Windows Sleep Variation

Hmm, I've discovered that Sleep(n) behaves very differently on my three Windows boxes.

(Also remember there are a lot of other issues with Sleep(n) ; the times are only reliable here because this is in a no-op test app)

This actually started because I was looking into Linux thread sleep timing, so I wrote a little test to just Sleep(n) a bunch of times and measure the observed duration of the sleep.

(Of course on Windows I do timeBeginPeriod(1) and bump my thread to very high priority (and timeGetDevCaps says the minp is 1)).

Anyway, what I'm seeing is this :


Win7 :
sleep(1) : average = 0.999 , sdev = 0.035 ,min = 0.175 , max = 1.568
sleep(2) : average = 2.000 , sdev = 0.041 ,min = 1.344 , max = 2.660
sleep(3) : average = 3.000 , sdev = 0.040 ,min = 2.200 , max = 3.774

Sleep(n) averages n
duration in [n-1,n+1]

WinXP :
sleep(1) : average = 1.952 , sdev = 0.001 ,min = 1.902 , max = 1.966
sleep(2) : average = 2.929 , sdev = 0.004 ,min = 2.665 , max = 2.961
sleep(3) : average = 3.905 , sdev = 0.004 ,min = 3.640 , max = 3.927

Sleep(n) averages (n+1)
duration very close to (n+1) every time (tiny sdev)

Win8 :
sleep(1) : average = 2.002 , sdev = 0.111 ,min = 1.015 , max = 2.101
sleep(2) : average = 2.703 , sdev = 0.439 ,min = 2.017 , max = 3.085
sleep(3) : average = 3.630 , sdev = 0.452 ,min = 3.003 , max = 4.130

average no good
Sleep(n) minimum very precisely n
duration in [n,n+1] (+ a little error)
rather larger sdev

it's like completely different logic on each of my 3 machines. XP is the most precise, but it's sleeping for (n+1) millis instead of (n) ! Win8 has a very precise min of n, but the average and max is quite sloppy (sdev of almost half a milli, very high variation even with nothing happening on the system). Win7 hits the average really nicely but has a large range, and is the only one that will go well below the requested duration.

As noted before, I had a look at this because I'm running Linux in a VM and seeing very poor performance from my threading code under Linux-VM. So I ran this experiment :


Sleep(1) on Linux :

native : average = 1.094 , sdev = 0.015 , min = 1.054 , max = 1.224
in VM  : average = 3.270 , sdev =14.748 , min = 1.058 , max = 656.297

(added)
in VM2 : average = 1.308 , sdev = 2.757 , min = 1.052 , max = 154.025

obviously being inside a VM on Windows is not being very kind to Linux's threading system. On the native box, Linux's sleep time is way more reliable than Windows (small min-max range) (and this is just with default priority threads and SCHED_OTHER, not even using a high priority trick like with the Windows tests above).

added "in VM2". So the VM threading seems to be much better if you let it see many fewer cores than you have. I'm running on a 4 core (8 hypercore) machine; the base "in VM" numbers are with the VM set to see 4 cores. "in VM2" is with the VM set to 2 cores. Still a really bad max in there, but much better overall.

3/16/2013

03-16-13 - Writing Portable Code Rambles

Some thoughts after spending some time on this (still a newbie). How I would do it differently if I started from scratch.

1. Obviously you all know the best practice of using your own data types (S32 or whatever) and making macros for any kind of common operation that the standards don't handle well (like use a SEL macro instead of ?: , make a macro for ROT, etc). Never use bit-fields, make your own macros for manipulating bits within words. You also have to make your own whole macro meta-language for things not quite in the language, like data alignment, restrict/alias, etc. etc. (god damn C standard people, spend some time on the actual problems that real coders face every day. Thanks mkay). That's background and it's the way to go.

Make your own defines for SIZEOF_POINTER since stupid C doesn't give you any way to check sizeof() in a macro. You probably also want SIZEOF_REGISTER. You need your own equivalent of ptrdiff_t and intptr_t. Best practice is to use pointer-sized ints for all indexing of arrays and buffer sizes.

(one annoying complication is that there are platforms with 64 bit pointers on which 64-bit int math is very slow; for example they might not have a 64-bit multiply at all and have to emulate it. In that case you will want to use 32-bit ints for array access when possible; bleh)

Avoid using "wchar_t" because it is not always the same size. Try to explicitly use UTF16 or UTF32 in your code. You could make your own SIZEOF_WCHAR and select one or the other on the appropriate platform. (really try to avoid using wchar at all; just use U16 or U32 and do your own UTF encoding).

One thing I would add to the macro meta-language next time is to wrap every single function (and class) in my code. That is, instead of :


int myfunc( int args );

do

FUNC1 int FUNC2 myfunc(int args );

or even better :

FUNC( int , myfunc , (int args) );

this gives you lots of power to add attributes and other munging as may be needed later on some platforms. If I was doing this again I would use the last style, and I would have two of them, a FUNC_PUBLIC and FUNC_PRIVATE to control linkage. Probably should have separate wrapper macros for the proto and the body.

While you're at it you may as well have a preamble in every func too :


FUNC_PUBLIC_BODY( int , myfunc , (int args) )
{
    FUNC_PUBLIC_PRE

    ...
}

which lets you add automatic func tracing, profiling, logging, and so on.

I wish I had made several different layers of platform Id #defines. The first one you want is the lowest level, which explicitly Id's the current platform. These should be exclusive (no overlaps), something like OODLE_PLATFORM_X86X64_WIN32 or OODLE_PLATFORM_PS3_PPU.

Then I'd like another layer that's platform *groups*. For me the groups would probably be OODLE_PLATFORM_GROUP_PC , GROUP_CONSOLE, and GROUP_EMBEDDED. Those let you make gross characterizations like on "GROUP_PC" you use more memory and have more debug systems and such. With these mutually exclusive platform checks, you should never use an #else. That is, don't do :

#if OODLE_PLATFORM_X86X64_WIN32
.. some code ..
#else
.. fallback ..
#endif
it's much better to explicitly enumerate which platforms you want to go to which code block, and then have an
#else
#error new platform
#endif
at the end of every check. That way when you try building on new platforms that you haven't thought carefully about yet, you get nice compiler notification about all the places where you need to think "should it use this code path or should I write a new one". Fallbacks are evil! I hate fallbacks, give me errors.

Aside from the explicit platforms and groups I would have platform flags or caps which are non-mutually exclusive. Things like PLATFORM_FLAG_STDIN_CONSOLE.

While you want the raw platform checks, in end code I wish I had avoided using them explicitly, and instead converted them into logical queries about the platform. What I mean is, when you just have an "#if some platform" in the code, it doesn't make it clear why you care that's the platform, and it doesn't make it reusable. For example I have things like :

#if PLATFORM_X86X64
// .. do string matching by U64 and xor/cntlz
#else
// unaligned U64 read may be slow
// do string match byte by byte
#endif
what I should have done is to introduce an abstraction layer in the #if that makes it clear what I am checking for, like :

#if PLATFORM_X86X64
#define PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS 1
#elif PLATFORM_PS3
#define PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS 0
#else
#error classify me
#endif

#if PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS
// .. do string matching by U64 and xor/cntlz
#else
// unaligned U64 read may be slow
// do string match byte by byte
#endif

then it's really clear what you want to know and how to classify new platforms. It also lets you reuse that toggle in lots of places without code duping the fiddly bit, which is the platform classification.

Note that when doing this, it's best to make high level usage-specific switches. You might be tempted to try to use platform attributes there. Like instead of "PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS" you might want to use "PLATFORM_SWITCH_UNALIGNED_READ_PENALTY" . But that's not actually what you want to know, you want to know if on my particular application (LZ string match) it's better to use big words or not, and that might not match the low level attribute of the CPU.

It's really tempting to skip all this and abuse the switches you can see (lord knows I do it); I see (and write) lots of code that does evil things like using "#ifdef _MSC_VER" to mean something totally different like "is this x86 or x64" ? Of course that screws you when you move to another x86 platform and you aren't detecting it correctly (or when you use MSVC to make PPC or ARM compiles).

Okay, that's all pretty standard, now for the new bit :

2. I would opaque out the system APIs in two levels. I haven't actually ever done this, so grains of salt, but I'm pretty convinced it's the right way to go after working with a more standard system.

(for the record : the standard way is to make a set of wrappers that tries to behave the same on all systems, eg. that tries to hide what system you are on as much as possible. Then if you need to do platform-specific stuff you would just include the platform system headers and talk to them directly. That's what I'm saying is not good.)

In the proposed alternative, the first level would just be a wrapper on the system APIs with minimal or no behavior change. That is, it's just passing them through and standardizing naming and behavior.

At this level you are doing a few things :

2.A. Hiding the system includes from the rest of your app. System includes are often in different places, and often turn on compiler flags in nasty ways. You want to remove that variation from the rest of your code so that your main codebase only sees your own wrapper header.

2.B. Standardizing naming. For example the MSVC POSIX funcs are all named wrong; at this level you can patch that all up.

2.C. Fixing things that are slightly different or don't work on various platforms where they really should be the same. For example things like pthreads are not actually all the same on all the pthreads platforms, and that can catch you out in nasty ways. (eg. things like sem_init always failing on Mac).

Note this is *not* trying to make non-POSIX platforms look like POSIX. It's not hiding the system you're on, just wrapping it in a standard way.

2.D. I would also go ahead and add my own asserts for args and returns in this layer, because I hate functions that just return error codes when there's a catastrophic failure like a null arg or an EHEAPCORRUPT or whatever.

So once you have this wrapper you no longer call any system funcs directly from your main codebase, but you still would be doing things like :


#if PLATFORM_WIN32

    HANDLE h = platform_CreateFile( ... )

#elif PLATFORM_POSIX

    int fd = platform_open( name , flags )

#else
    #error unknown platform
#endif

that is, you're not hiding what platform you're on, you're still letting the larger codebase get to the low level calls, it's just the mess of how fucked they are that's hidden a bit.

3. You then have a second level of wrapping which tries to make same-action interfaces that dont require you to know what platform you're on. Second level is written on the first level.

The second level wrappers should be as high level as necessary to opaque out the operation. For example rather than having "make temp file name" and "open file" you might have "open file with temp name", because on some platforms that can be more efficient when you know it is a high-level combined op. You don't just have "GetTime" you have "GetTimeMonotonic" , because on some platforms they have an efficient monotonic clock for you, and on other platforms/hardwares you may have to do a lot of complicated work to ensure a reliable clock (that you don't want to do in the low level timer).

When a platform can't provide a high-level function efficiently, rather than emulate it in a complex way I'd rather just not have it - not a stub that fails, but no definition at all. That way I get a compile error and in those spots I can do something different, using the level 1 APIs.

The first level wrappers are very independent of the large code base's usage, but the second level wrappers are very much specifically designed for their usage.

To be clear about the problem of making platform-hiding second layer wrappers, consider something like OpenFile(). What are the args to that? What can it do? It's hopeless to make something that works on all platforms without greatly reducing the capabilities of some platforms. And the meaning of various options (like async, temporary, buffered, etc.) all changes with platform.

If you wanted to really make a general purpose multi-platform OpenFile you would have to use some kind of "caps" query system, where you first do something like OpenFile_QueryCaps( OF_DOES_UNBUFFERED_MEAN_ALIGNMENT_IS_REQUIRED ) and it would be an ugly disaster. (and it's obviously wrong on the face of it, because really what you're doing there is saying "is this win32" ?). The alternative to the crazy caps system is to just make the high level wrappers very limited and specific to your usage. So you could make a platform-agnostic wrapper like OpenFile_ForReadShared_StandardFlagsAndPermissions(). Then the platforms can all do slightly different things and satisfy the high level goal of the imperative in the best way for that platform.

A good second level has as few functions as possible, and they are as high level as possible. Making them very high level allows you to do different compound ops on the platform in a way that's hidden from the larger codebase.

3/10/2013

03-10-13 - Two LZ Notes

Note 1 : on rep matches.

"Rep matches" are a little weird. They help a lot, but the reason why they help depends on the file you are compressing. (rep match = repeat match, gap match, aka "last offset")

On text files, they work as interrupted matches, or "gap matches". They let you generate something like :


stand on the floor
stand in the door

stand in the door
[stand ][i][n the ][d][oor]

[off 19, len 6][1 lit][rep len 6][1 lit][off 18, len 3]

that is, you have a long match of [stand on the ] but with a gap at the 'o'.

Now, something I observed was that more than one last offset continues to help. On text the main benefit from having two last offsets is that it lets you use a match for the gap. When the gap is not just one character but a word, you might want to use a match to put that word in, in which case the continuation after the gap is no longer the first last-offset, it's the second one. eg.


cope
how to work with animals
how to cope with animals

[how to ][cope][ with animals]
[off 25 ][off 32][off 25 (rep2)]

You could imagine alternative coding structures that don't require keeping some number of "last offsets". (oddly, the last offset maintenance can be a large part of decode time, because maintaining an MTF list is something that CPUs do incredibly poorly). For example you could code with a scheme where you just send the entire long match, and then any time you send a long match you send a flag for "are there any gaps", and if so you then code some gaps inside the match.

The funny thing is, on binary files "last offsets" do something else which can be more important. They become the most common offsets. In particular, on highly structured binary data, they will generally be some factor of the structure size. eg. on a file that has a struct size of 36, and that struct has dwords and such in it, the last offsets will generally be things like 4,8,16,36, or 72. They provide a sort of dictionary of the most common offsets so that you can code those smaller. You are still getting the gap-match effect, but the common-offset benefit is much bigger on these files.

(aside : word-replacing transform on text really helps LZ (and everything) by removing the length variance of tokens. In particular for LZ77, word length variance breaks rep matches. There are lots of common occurances of a single replaced word in a phrase, like : "I want some stuff" -> "I want the stuff". You can't get a rep match here of [ stuff] because the offset changed because the substituted word was different length. If you do WRT first, then gap matches get these.)

Note 2 : on offset structure.

I've had it in the back of my head for quite some time now to do an LZ compressor specifically designed for structured data.

One idea I had was to use "2d" match offsets. That is, send a {dx,dy} where dx is within the record and dy is different records. Like imagine the data is in a table, dy is going back rows, dx is an offset on the row. You probably want to mod dx around the row so its range is always the same, and special case dy=0 (matches within your own record).

It occurred to me that the standard way of sending LZ offsets these days actually already does this. The normal way that good LZ's send offsets these days is to break it into low and high parts :

low = offset & 7F;
high = offset >> 7;
or similar, then you send "high" using some kind of "NoSB" scheme (Number of Significant Bits is entropy coded, and the bits themselves are sent raw), and you send "low" with an order-0 entropy coder.

But this is just a 2d structured record offset for a particular power-of-2 record size. It's why when I've experimented with 2d offsets I haven't seen huge wins - because I'm already doing it.

There is some win to be had from custom 2d-offsets (vs. the standard low/high bits scheme) when the record size is not a power of two.

3/06/2013

03-06-13 - Sympathy for the Library Writer

Over the years of being a coder who was a library-consumer and not a library-writer, I've done my share of griping about annoying API's or what I saw as pointless complication or ineffiency. Man, I've been humbled by my own experience trying to write a public library. It is *hard*.

The big problem with libraries is that you don't control how they're used. This is in contrast to game engines. Game engines are not libraries. I've worked on many game engines over the years, including ones that went out to large free user bases (Genesis 3d and Wild Tangent), and they are much much easier than libraries.

The difference is that game engines generally impose an architecture on the user. They force you to use it in a certain way. (this is of course why more advanced developers despise them so much; it sucks to have some 3rd party telling you your code architecture). It's totally acceptable if a game engine only works well when you use it in the approved way, and is really slow if you abuse it, or it could even crash if you use it oddly.

A library has to be flexible about how it's used; it can't impose a system on the user, like a certain threading model, or a certain memory management model, or even an error-handling style.

Personally when I do IO for games, I make a "tool path" that just uses stdio and is very simple and flexible, does streaming IO and text parsing and so on, but isn't shipped with the game, and I make a "game path" that only does large-block async IO that's pre-baked so you can just point at it. I find that system is powerful enough for my use, it's easy to write and use. It means that the "tool path" doesn't have to be particularly fast, and the fast game path doesn't need to support buffered character IO or anything other than big block reads.

But I can't force that model on clients, so I have to support all the permutations and I have to make them all decently fast.

A lot of times in the past I've complained about over-complicated APIs that have tons of crazy options that nobody ever needs (look at the IJG jpeg code for example). Well, now I see that often those complicated APIs were made because somebody (probably somebody important) needed those options. Of course as the library provider you can offer the complex interface and also simpler alternatives, but that has its own pitfalls of making the API bigger and more redundant (like if you offer OpenFileSimple and OpenFileComplex); in some ways it's better to only offer the complex API and make the user wrap it and reduce the parameter set to what they actually use.

There's also a sort of "liability" issue in libraries. Not exactly legal liability, but program bad behavior liability. Lots of things that would make the library easier to use and faster are naughty to do automatically. For example Oodle under Vista+ can run faster with elevated priviledge, to get access to some of the unsecure file APIs (like extending without zeroing), but it would be naughty for me to do that automatically, so instead I have to add an extra step to make the client specifically ask for that.

Optimization for me has really become a nightmare. At first I was trying to make every function fast, but it's impossible, there are just too many entry points and too many usage patterns. Now my philosophy is to make certain core functions fast, and then address problems in the bigger high level API as customers see issues. I remember as a game developer always being so pissed that all the GL drivers were specially optimized for Id. I would want to use the API in a slightly different style, and my way would be super slow, not for any good reason but just because it hadn't gotten the optimization loving of the important customer's use case.

I used to also rail about the "unnecessary" argument checking that all the 3d APIs do. It massively slows them down, and I would complain that I had ensured the arguments were good so just fucking pass them through, stop slowing me down with all your validation! But now I see that if you really do that, you will just constantly be crashing people as they pass in broken args. In fact arg validation is often the way that people figure out the API, either because they don't read the docs or because the docs are no good.

(this is not even getting into the issue of API design which is another area where I have been suitably humbled)

ADDENDUM : I guess I should mention the really obvious points that I didn't make.

1. One of the things that makes a public library so hard after release is that you can't refactor. The normal way I make APIs for myself (and for internal teams) is to sort of make an effort at a good API the first time, but it usually sucks, and you rip it out and go through big scourges of find-rep. That only works when you control all the code, the library and the consumer. It's only after several iterations that the API becomes really nice (and even then it's only nice for that specific use case, it might still suck in the wild).

2. APIs without users almost always suck. When someone goes away in a cave and works on a big new fancy library and then shows it to the world, it's probably terrible. This a problem that I think everyone at RAD faces. The code of mine that I really like is stuff that I use over and over, so that I see the flaws and when I want it to be easier to use I just go fix it.

3. There are two separate issues about what makes an API "good". One is "is it good for the user?" and one is "is it good for the library maintainer?". Often they are the same but not always.

Anyway, the main point of this post is supposed to be : the next time you complain about a bad library design, there may well be valid reasons why it is the way it is; they have to balance a lot of competing goals. And even if they got it wrong, hey it's hard.

3/01/2013

03-01-13 - Zopfli

zopfli seems to make small zips. There's no description of the algorithm so I can't comment on it. But hey, if you want small zips it seems to be the current leader.

(update : I've had a little look, and it seems to be pretty straightforward, it's an optimal parser + huff reset searcher. There are various other prior tools to do this (kzip,deflopt,defluff,etc). It's missing some of the things that I've written about before here, such as methods of dealing with the huff-parse feedback; the code looks pretty clean, so if you want a good zip-encoder code it looks like a good place to start.)

I've written these things before, but I will summarize here how to make small zips :

1. Use an exact (windowed) string matcher.

cbloom rants 09-24-12 - LZ String Matcher Decision Tree

2. Optimal parser. Optimal parsing zip is super easy because it has no "repeat match", so you can use plain old backwards scan. You do have the huffman code costs, so you have to consider at least one match candidate for each codeword length.

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 09-04-12 - LZ4 Optimal Parse

3. Deep huffman reset search. You can do this pretty easily by using some heuristics to set candidate points and then building a bottom-up tree. Zopfli seems to use a top-down greedy search. More huffman resets makes decode slower, so a good encoder should expose some kind space-speed tradeoff parameter (and/or a maximum number of resets).

cbloom rants 06-15-10 - Top down - Bottom up
cbloom rants 10-02-12 - Small note on Adaptive vs Static Modeling

4. Multi-parse. The optimal parser needs to be seeded in some way, with either initial code costs or some kind of heuristic parse. There may be multiple local minima, so the right way to do it is to run 4 seeds (or so) simultaneously with different strategies.

cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies

5. The only unsolved bit : huffman - parse feedback. The only solution I know to this is iteration. You should use some tricks like smoothing and better handling of the zero-frequency symbols, but it's just heuristics and iteration.


One cool thing to have would be a cheap way to compute incremental huffman cost.

That is, say you have some array of symbols. The symbols have a corresponding histogram and huffman code. The full huffman cost is :

fullcost(symbol set) = cost{ transmit code lengths } + sum[n] { codelen[n] * count[n] }
that is, the cost to send the code lengths + the cost of sending all the symbols with those code lengths.

You'd like to be able to do an incremental update of fullcost. That is, if you add one more symbol to the set, what is the delta of fullcost ?

*if* the huffman code lengths don't change, then the delta is just +codelen[symbol].

But, the addition of the symbol might change the code lengths, which causes fullcost to change in several ways.

I'm not sure if there's some clever fast way to do incremental updates; like when adding the symbol pushes you over the threshold to change the huffman tree, it often only changes some small local part of the tree, so you don't have to re-sum your whole histogram, just the changed part. Then you could slide your partition point across an array and find the optimal point quite quickly.


Now some ranting.

How sad is it that we're still using zip?

I've been thinking about writing my own super-zipper for many years, but I always stop myself because WTF is the point? I don't mean for the world, I guess I see that it is useful for some people, but it does nothing for *me*. Hey I could write some thing and probably no one would use it and I wouldn't get any reward from it and it would just be another depressing waste of some great effort like so many other things in my life.

It's weird to me that the best code in the world tends to be the type of code that's given away for free. The little nuggets of pure genius, the code that really has something special in it - that tends to be the free code. I'm thinking of compressors, hashers, data structures, the real gems. Now, I'm all for free code and sharing knowledge and so on, but it's not equal. We (the producers of those gems) are getting fucked on the deal. Apple and the financial service industry are gouging me in every possible immoral way, and I'm giving away the best work of my life for nothing. It's a sucker move, but it's too late. The only sensible play in a realpolitik sense of your own life optimization is to not work in algorithms.

Obviously anyone who claims that patents provide money to inventors is either a liar (Myhrvold etc.) or just has no familiarity with actual engineering. I often think about LZ77 as a case in point. The people who made money off LZ77 patents were PK and Stac, both of whom contributed *absolutely nothing*. Their variants were completely trivial obvious extensions of the original idea. Of course the real inventors (L&Z, and the modern variant is really due to S&S) didn't patent and got nothing. Same thing with GIF and LZW, etc. etc. perhaps v42 goes in there somewhere too; not a single one of the compression-patent money makers was an innovator. (and this is even igoring the larger anti-patent argument, which is that things like LZ77 would have been obvious to any number of researchers in the field at the time; it's almost always impossible to attribute scientific invention/discovery to an individual)

2/23/2013

02-23-13 - Threading Patterns - Wake Polling

Something I've written about a lot but never given a solid name to.

When a thread is waiting on some condition, your goal should be to only wake it up if that condition is actually true - that is, the thread really gets to run. In reverse order of badness :

1. Wakeup condition polling. This is the worst and is very common. You're essentially just using the thread wakeup to say "hey your condition *might* be set, check it yourself". The suspect code looks something like :


while( ! condition )
{
    Wait(event);
}

these threads can waste a ton of cycles just waking up, checking their condition, then going back to sleep.

One of the common ways to get nasty wake-polling is when you are trying to just wake one thread, but you have to do a broadcast due to the possibility of a missed wakeup (as in the naive semaphore from waitset ).

Of course any usage of cond_var is a wake-poll loop. I really don't like cond_var as an API or a programming pattern. It encourages you to write wakee-side condition checks. Whenever possible, waker-side condition checks are better. (See previous notes on cond vars such as : In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition").

(ADDENDUM : in fact I dislike cond_var so much I wrote a proposal on an alternative cond_var api ).

Now it's worth breaking this into two sub-categories :

1.A. Wake-polling when it is extremely likely that you get to run immediately.

This is super standard and is not that bad. At root, what's happening here is that under normal conditions, the wakeup means the condition is true and you get to run. The loop is only needed to catch the race where someone stole your wakeup.

For example, the way Linux implements semaphore on futex is a classic wake-poll. The core loop is :


    for(;;)
    {
        if ( try_wait() ) break;

        futex_wait( & sem->value, 0 ); // wait if sem value == 0
    }

If there's no contention, you wake from the wait and get to try_wait (dec the count) and proceed. The only time you have to loop is if someone else raced in and dec'ed the count before you. (see also in that same post a discussion of why you actually *want* that race to happen, for performance reasons).

The reason this is okay is because the futex semaphore only has to do a wake 1 when it signals. If it had to do a broadcast, this would be a bad loop. (and note that the reason it can do a broadcast is due to the special nature of the futex wait, which ensures that the single thread signal actually goes to someone who needs it!) (see : cbloom rants 08-01-11 - Double checked wait ).

1.B. Wake-polling when it is unlikely that you get to run.

This is the really bad one.

As I've noted previously ( cbloom rants 07-26-11 - Implementing Event WFMO ) this is a common way for people to implement WFMO. The crap implementation basically looks like this :


while ( any events in array[] not set )
{
    wait on an unset event in array[]
}

What this does is any time one of the events in the set triggers, it wakes up all the waiters that are waiting on it in an array, checks the array, and they go back to sleep.

Obviously this is terrible, it causes bad "thread thrashing" - tons of wakeups and immediate sleeps just to get one thread to eventually run.

2. "Direct Handoff" - minimal wakes. This is the ideal; you only wake a thread when you absolutely know it gets to run.

When only a single thread is waiting on the condition, this is pretty easy, because there's no issue of "stolen wakeups". With multiple threads waiting, this can be tricky.

The only way to really robustly ensure that you have direct handoff is by making the wakeup ensure the condition.

At the low level, you want threading primitives that don't give you unnecessary wakeups. eg. we don't like the pthreads cond_var that has you call :

    condvar.wait();
    mutex.lock();
as two separate calls, which means you can wake from the condvar and immediately fail to get the mutex and go back to sleep. Prefer a single call :
    condvar.wait_then_lock(mutex);
which only wakes you when you get a cv signal *and* can acquire the mutex.

At the high level, the main thing you should be doing is *waker* side checks.

eg. to do a good WFMO you should be checking for all-events-set on the *waker* side. To do this you must create a proxy event for the set when you enter the wait, register all the events on the proxy, and then you only signal the proxy when they are all set. When one of them is set, it does the checking. That is, the checking is moved to the signaller. The advantage is that the signalling thread is already running.

old rants