2/25/2009

02-25-09 - Low Level Threading - Part 4.1

LIFO MPMC Stack

Last time we learned about publishing with CAS-retry, so let's use it to make our first LockFree data structure. The simplest one is probably the linked list LIFO (aka a stack) so we'll do that.

There is one global pointer to the top of the stack. Nodes have a next pointer. Something like this :

struct Node
{
    Atomic< Node * > m_next;
};

Atomic< Node * > g_stack = NULL;

This is an MPMC structure (Multi-Producer Multi-Consumer). Basically that means anybody in the world is allowed to be jamming on g_head at any time and it's all good. Remember that Atomic< > in my writeups here just means that we ensure that data type can be addressed atomically, it doesn't imply anything about memory ordering or anything like MSVC 2005 volatile or anything like that.

Note that "Node" has no data in it. That's intentional - this is meant to be used intrusively. That is, you put Node at the top of your own data structure inline, and then you pass in your own structures. I always try to implement things intrusively at the low level because you can always build non-intrusive on top of intrusive, but not vice-versa. Intrusive is more fundamental and can be much more efficient.

Other than the Atomic this looks just like a normal stack you would implement. So what would you normally write to push a new node ?


void Push(Node * node)
{
    node->m_next = g_stack;
    // (*)
    g_stack = node;
}

That's pretty simple and it almost works lock free, but there's one problem that should be obvious. "g_stack" is a shared variable that anybody else can be touching , and at (*) it can change so that we are doing something inconsistent - stomping on someone else's change.

Obviously any time we might be stomping on someone else's change, a CAS-retry is the natural suggestion, so let's fix it :


LF_Result Push_NoLoop(Node * node)
{
    // load g_stack to a local var, no memory order needed :
    Node * localStack = LoadRelaxed(&g_stack);

    // "node" is wholy owned by me, no one else can see it, so just normal store :
    node->m_next = localStack;

    // try to change g_stack from "localStack" to "node"
    //    if it fails it means there was contention :

    return CAS_Release( &g_stack, localStack, node );

    // CAS has release order so that when g_stack is set to node, 
    //  previous writes inside node are published correctly
    // CAS returns {Success,Failure,Abort}
}

This should look very much like the above Push, we've just been a bit more careful about things and now it works lock free. Of course Push_NoLoop doesn't actually always do the push - to do that we need to loop :

void Push(Node * node)
{
    while( Push_NoLoop(node) != LF_Success )
    {
    }
}

Note that in these retry spins there's not much point of doing sleep-backoff type stuff (maybe there's a tiny bit of a point for super heavy cache contention), but I go ahead and use my SpinBackoff helper for these anyway just so that it can assert if I accidentally write a bug and code an infinite loop.

(also note that really the Push loop should be using CAS to reload localStack but I find it more educational to write it this way for now).

Let's go ahead and write the Pop the same way. Single threaded Pop is :


Node * Pop()
{
    Node * node = g_stack;
    if ( ! node ) return NULL;
    g_stack = node->m_next;
    return node;
}

There are obviously the same issues in here as we had with Push, so we try to fix it up :

LF_Result Pop_NoLoop(Node ** pNode)
{
    Node * localStack = LoadAcquire(&g_stack);  // (*1)
    // Acquire memory order here because we will read stuff dependent on localStack

    *pNode = localStack; // we hope to return this

    if ( localStack == NULL ) return LF_Success; // empty stack , *pNode is NULL
    
    Node * nextStack = localStack->m_next; // relaxed load is fine here //(*2)

    // try to step stack to g_stack->next if noone else has changed it
    return CAS_Release(&g_stack,localStack,nextStack);

    // if we don't return Success then don't touch *pNode !!
}

and again to make a Pop that always gets a Node you just spin on this. This is similar to what we did before and it mostly works, but there's a really big dirty problem here which will plague all our lock free work.

The issue is what do the consumers typically do with nodes after they pop them? They look inside at the data and poke around with it - and then they free them! (or recycle them or something, they don't just leave them alone forever). So a typical consumer might look like :


Node * node = Pop();
if ( node )
{
    int got = node->data;
    free(node);
}

But the node that we just popped used to be the top of the stack. That means it was in g_stack, and many other threads may have grabbed a pointer to it at (*1). We happend to be the first one to return successfully so we got the node back and freed it, but at (*1) they still have localStack pointing at our node.

Now they walk along, la di da, and at (*2) they try to look inside localStack. BOOM access to freed memory.

Note that this occurred because Pop() is allowed to be multi-consumer. If we restrict our stack and say only one thread is ever allowed to Pop (single consumer), then this problem does not occur. That is, by going to single-consumer we have made all the Pops run single-threaded against each other, so they only need to worry about multi-threading against Push, which is safe.

This is a common and important pattern : Multi-Producer is easy. Multi-Consumer is hard. The general reason is that Producer doesn't have any branches or invalid state. It just makes a new node and sticks it on. Note that in the single-threaded version of the stack push it always just unconditionally pushes a not. The Pop on the other hand has to deal with the empty case. It is shrinking the size of the data structure getting towards an invalid empty state, and it is discarding nodes. We'll see more about this again later.

How do we handle this possible access to freed memory? First of all, notice that it is benign. That is, if reading from freed memory just gave us back a random value, that would be okay, because we will always fail the CAS and the bad value that we read will be discarded. So, there are three fixes :

1. Garbage Collection / Safe Memory Reclamation / etc. : allocate Nodes with a special allocator that prevents them from actually being freed while there's any possibility of a thread touching it. SMR is an efficient way to do this in C++ that's quite elegant. SMR can keep around as many extra nodes as there are consumer threads. BTW the simplest form of GC actually works fine in most applications : just keep the nodes around forever. Allocate them from a Node Pool that only ever grows and recycles them, it never actually hands them back to the OS. You probably want some kind of Node Pool / GC allocator if you're passing around a bunch of Node shitlets.

2. Just free it, but don't return it to the OS. If you're using your own malloc you can literally just Free() the node and put it back in the malloc pool and let somebody else grab it and write in it. That's totally fine - you might read random junk out of it, but that doesn't hurt here. You just don't want it to get all the way back to the OS and get marked so you page fault.

3. Catch the page fault.

This last one is rather cute, and is in fact what Windows SList does. If we get a page fault at (*2) reading localStack->next it means somebody popped off head and freed it while we were working. That just means there was contention and we should retry, so we can change Pop easily :


LF_Result Pop_NoLoop(Node ** pNode)
{
    Node * localStack = LoadAcquire(&g_stack);
    *pNode = localStack; // we hope to return this

    if ( localStack == NULL ) return LF_Success;

    Node * nextStack;
    __try
    {
        nextStack = localStack->m_next; // possible access violation
    }
    __except()
    {
        return LF_Abort; // page fault - contention - retry
    }

    return CAS_Release(&g_stack,localStack,nextStack);
}

that works great and is plenty fast on Windows, but sadly having a fast hard exception catcher is not portable, so we can't rely on this method. It's also not really that important because we are writing code that we control, so solutions #1 or #2 are totally fine for us. You can see that this technique is very cool for Windows SList because they cannot control how you do your mallocs and frees, so they must support any memory allocation pattern, which includes the memory being totally freed.

BTW this whole issue is another reason why GC languages like Java or C# are much easier to write lock-free code in.

Okay, we have our Stack juju going, but there's still a huge problem in here which we will address in the next post.

02-25-09 - Cast with union

Sometimes I need to cast an object to another as a "bit cast" , not a "value cast" (I don't know the good terms for this). The classic way is :

template <typename t_to, typename t_fm>
t_to & same_size_bit_cast( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    return *( (t_to *) &from );    
}

Which works fine except that I guess it freaks out GCC and might cause aliasing problems and such. (maybe I can just mark that up right and not have trouble?).

The union cast :


template <typename t_to, typename t_fm>
t_to same_size_bit_cast( t_fm from )
{
    RR_COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    union _bit_cast_union
    {
        t_fm fm;
        t_to to;
    };
    _bit_cast_union converter = { from };
    return converter.to;
}

*almost* works. It works some times, like on basic types it works no problem. If t_to or t_fm has constructors or assignment operators though, it refuses to compile which is damn annoying (it would be handle in C++ if there was a way to say "I want an instance of this object here but don't run any constructors or destructors, just give me the member layout without the associated functions"). Often the whole reason I'm doing this is because I want to convert a more complex object into something I can just treat as a bag of bits. (for example if you want to use the windows Interlocked64 ops to work an LFSList type, you need to cast from a struct to a LONGLONG).

What's worse is that if t_fm or t_to has certain type qualifiers that get attached in the template argument deduction, it fails, and it becomes a big mess of casting back on the user side to figure it out and get that stuff off. (qualifiers like const or volatile or will make the union illegal).

Furthermore there's a nasty difference in that the latter gives you a *copy* while the first gives you a reinterpret pointer to the same memory that you can fiddle with, which is what I'd rather have, but I guess that's verboten these days.

2/24/2009

02-24-09 - Low Level Threading - Part 3.1

Atomic RMW and Hardware

The next building block we need is some kind of atomic RMW. We've talking about getting atomic loads & stores and memory ordering, but we need a way to change a value and know that we are changing the original value.

The classic example is two threads trying to increment a shared counter :

Atomic< int > g_counter = 0;

// thread1 :
g_counter ++;

// thread2 :
g_counter ++;
After this runs, g_counter might be 1 or 2 !! The problem is what ++ is actually doing :
g_counter ++ :

temp = load g_counter
temp ++;
store g_counter , temp
Now you should be able to see that both threads can load g_counter at 0, both can increment to 1 in their temp vars, then both store a 1. To fix this we could take a lock :
lock(g_counter)
temp = load g_counter
temp ++;
store g_counter , temp
unlock(g_counter)
And that is exactly the Atomic read-modify-write (RMW) that we need. If we use this for g_counter++ then it will always be = 2 after our program runs.

Fortunately almost all hardware provides some kind of cheap atomic micro-lock mechanism to do this. On x86 you have the "lock prefix" instructions (the Interlocked intrinsics directly correspond to lock instructions for the most part). On Power you have a load-linked/store-conditional with something like lwarx-stwcx. On Power you are almost literally writing a micro-lock and unlock on the cache line. On x86 it's more implicit that a lock is taken for a period of time, it pretends to be a simple instruction. A bit more on this later.

The most powerful and useful atomic RMW is probably the CAS (Compare and Swap). I'm going to write CAS using the C++0x style since that is becoming more and more standard around the web and in the literature :


// T must be Atomic
bool CAS(T * pVar, T & oldVal, T newVal)
{
    lock(pVar);  // "micro-lock" the variable
    
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    

    unlock(pVar);

    return match;
}

CAS checks if *pVar is = oldVal and if it is, then it stuffs in newVal. Note that oldVal is a *reference* and it also reads the current value of pVar into there. (I really despise that style of coding but it's the way the C++0x standard CAS is defined and you will see people rely on that behavior, so be aware).

Actually C++0x defines CAS a little differently; it can have spurious failures. Basically that means that the micro-lock on the variable can fail and CAS can return false even though *pVar was == oldVal :


bool CAS(T * pVar, T & oldVal, T newVal)
{
    if ( ! trylock(pVar) )
        return false;
    
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    

    unlock(pVar);

    return match;
}

On x86 that doesn't really make any sense, but on other architectures where the RMW is implemented using a lock-page loop it gives them more flexibility to be able to fail to lock.

Note that the arguments are in the opposite order of InterlockedCompareExchange which can be a bit confusing.

CAS is super useful and can be used to implement almost anything. For example, CAS could be use to implement increment :

void AtomicIncrement( Atomic * pVar )
{
    do
    {
        Atomic old = Load( pVar );
        Atomic new = old + 1;
    } while ( ! CAS(pVar,old,new) );
}
Note that I failed to take advantage of the fact that CAS is reading oldval for me in the second operand, really this should be :
void AtomicIncrement( Atomic * pVar )
{
    Atomic old = Load( pVar );
    do
    {
        Atomic new = old + 1;
    } while ( ! CAS(pVar,old,new) );
}
When CAS fails, old is reloaded to the new value, we add 1 to that and try again. Obviously this is a crazy way to implement increment, the hardware probably has a direct atomic increment, this just shows the power of CAS, and also why C++0x made it reload old val in the second arg, it makes stuff like this very handy.

Note that I haven't been talking about the memory order semantics here. The reason is that the client algorithm needs to specify the memory order semantics that are needed for that operation. CAS itself does not have a memory order requirment, you might want to CAS_Acquire or CAS_Release or CAS_Acq_Rel or CAS_Sequential. They all have their uses depending on what the actual operation needs, and they enforce ordering of memory ops outside the CAS against the CAS (which is treated as one op).

The other really basic operation that we will use a lot is "Exchange". Exchange is almost always a lot faster than CAS. It just reads the old value and stuffs the new value atomically.

We are now ready to implement our own simple mutex :


// 0 is unlocked, 1 is locked
typedef Atomic< int > SimpleMutex; 

void SimpleLock(SimpleMutex * mut)
{    
    SpinBackOff backoff;
    
    while( ! CAS_Sequential(mut,0,1) )
    {
        // mut was not 0 , spin/yield/backoff try again
        backoff.BackOffYield();
    }
    // mut should now be 1    
}
    
void SimpleUnlock(SimpleMutex volatile * mut)
{
    // *mut should now be 1
    StoreRelease(mut,0);
}

(there are a lot of possible memory semantics you might want for your mutex, this is only one possibility; there are subtle reasons that you want mutex locks to be globally sequential and not just acquire which I won't go into)
(obvious this isn't reentrant or any of that stuff and I'm not suggesting you use this instead of critsec - critsec is better in almost every way, this is just an example)
(backoff is a little helper struct I use that spins a while then sleeps with increasing intervals).

Sequential memory semantics means that memory operations cannot move past it in either direction, and ALSO that there is a global total ordering. That is, if we have three threads doing this :

// thread 1
Store(g_global,1)

// thread 2
Store(g_global,2)

// thread 3
Store(g_global,3)
And each thread runs on its own processor, then processor 1 might see g_global be 2 then 3 then 1. Processor 2 might see g_global be 3 then 2 then 1. Obviously that makes no sense in a total order, but the memory ops are allowed to be seen by the processors in different orders as long as they satisfy the contraints placed on them.

If we make all those stores into StoreSequential, then all processors will see the ops change in the same way. They might them at different times but they see them in the same order. Think about it like a room full of people. Everybody needs to tell everyone else their name. If you just let everyone go and tell their name to other people, then each individual will hear all the others' names in very different order. eg. Mike might hear Alex then Andy then Dave then Chris. Dave might hear Alex then Mike then Chris then Andy. The total order basically means first one person raises their hand, then they say their name to everyone, then the next person raises their hand. Thus everybody hears each name in the same order.

Obviously Sequential ops are a lot more expensive because they require all the processors in the system to agree on a sync point that they will not reorder other sequential operations across.

Another common memory ordering we haven't really talked about is "Acq_Rel" or Acquire+Release. This means that loads can't move up before it, and stores can't move down past it, but loads could move down past it and stores could move up before it. Acq_Rel also doesn't imply a total order. You often want to use Acq_Rel when you are creating a sync point that does both a read and a publish, but doesn't need full Sequential order. For example :


Load(a); // can move down
Store(b); // cannot move down

CAS_Acq_Rel( stuff );

Load(c); // cannot move up
Store(d); // can move up


// so it could change to this :
// (assuming a,b,c,d don't overlap)

Store(b);
Store(d);

CAS_Acq_Rel( stuff );

Load(a);
Load(c);

Acq_Rel vs. Sequential is sort of like lwsync vs hwsync on Power (but not exactly I don't think). Acq_Rel is useful when I am trying to use CAS to both publish something and to get some values.

Okay, I think that's the majority of the tools we need. I'm going to talk a bit about what actually happens on x86, but you can just ignore this part if you want.

We already talked last time about how x86 is already automatically acquire & release, that is, there are #LoadLoad and #StoreStore barriers everywhere - but acquire and release say nothing about how loads move against stores (that would be #LoadStore and #StoreLoad barriers - the notation is that #XY means X must occur before Y). Well it turns out x86 also automatically has #LoadStore ordering. The x86 multiprocessor model was designed to make single-proc code "just work" and they did a fuck of a good job of it. The crucial thing is that Stores coming out of any core come out in the order of the program code. Obviously the #StoreStore barriers everywhere enforce that. What it all means is that Stores must be done at the very end of the x86 execution pipe and retired in order. Thus there's no way for a load to move after a store - in fact we've seen that loads can get executed speculatively, but there's no way to execute a store speculatively because that would invalidate the order coming out of the chip.

The remaining barrier that we don't have is #StoreLoad , to keep loads after stores. Note that #StoreLoad is automatic for access to the same variable (and doesn't cause a Load-Hit-Store stall on x86 because we have the awesome thing where the load can peek into the write buffer for the new value), however it its not automatic for *different* variables. eg.

g_var1 = a;
...
b = g_var2
The read of var2 can move before the write to var1. As I've said before, this is *good* you want your processor to be able to reorder things. You just need to be able to tell it when not to. On x86 the only way to make a #StoreLoad barrier is with an mfence or any interlocked op.

Interlocked ops on x86 are always sequentially consistent. That makes coding easier, but means they are very expensive. (Itanium has some Acquire & Release interlocked ops ; presumably future Intel stuff will have some weaker memory model instructions). Basically to execute a "lock" instruction the x86 CPU does the pseudo code that I wrote above for CAS. First it goes out to the bus and reserves a cache line (x86 interlocks actually hold whole cache lines which makes any op inside a cache line atomic) then it runs the cmpxchg instruction or whatever you wanted, then it releases the cache line.

When it reserves the cache line, all other processors who try to acquire that line are blocked and might stall. When it releases the cache line, all processors that had work based on that line in flight must dump it. That is, any speculative reads and all computations based on them must be invalidated from the deep execution pipe of modern x86 chips.

On modern multi-core chips with shared cache this is bad, but it's not awful. On SMP machines where the cache line reservation has to go out to bus, it's very bad indeed. And what's really awful is that a CAS on one chip doesn't just slow down that chip - it syncs the bus, and it *really* slows things down if anybody else is touching that cache line.

Even though Interlocked CAS is "lock free" it is by no means cheap, and the people who write really efficient lock free stuff try to do so even without CAS.

Note that the crucial issue here was *contention* over a *cache line*. That is a common theme and is the main thing that you should actually worry about. If you have some shared data item which is heavily contended (lots of threads need to hit it often), then changing it from a CritSec to an Interlocked CAS will help some, but the CPUs are still going to be stalling on each other like crazy. You need to reduce the contention over a single data items. You also need to worry about so-called "false sharing". Any data items on the same cache line count for contention, so even though you were good boy and made your threads work on separate variables, if they are on the same cache line you still pay much of the price because they will sit and fight over that cache line over and over.

Okay, that's enough for basics. Next time we'll look at some techniques for lock free algorithms.

02-24-09 - Low Level Threading - Part 3.0

Intro to Lock Free Programming

Disclaimer : don't do it. Okay now let's dive in.

"Lock Free" is what this general group of techniques is usually called, but it's a bit of a misnomer. "Lock Free" just means not using any blocking thread synthronization techniques like mutexes, however the classic lock free technique of try-discard-loop is basically a spin lock. Hence there is another group of techniques called "Wait Free" to indicate when not even that kind of wait is necessary. Another class recently coined is "Obstruction Free". And in fact we don't care if we sometimes take mutexes to resolve sticky spots (a lot of people now are wisely doing this - for example queues where you take a mutex when near empty).

A lot of people take "Lock Free" to mean avoiding the OS mutex because it's too slow. That's not really a great goal and you won't see a huge speedup if you do that. Good Lock Free programming is about redesigning your code flow to minimize contention for shared resources and to spend a minimum amount of time in contention. Obviously one form of contention is fighting over an OS mutex, but fighting over a cache line is contention too and hurts badly.

Our goal is just to do fast & correct thread synchronization. We don't want to cut any corners and introduce possible races. Any tiny speed win from cheating is just not worth it.

In lock free programming we are basically trying to share some data between threads in a way that we can understand sequentially so that we can code it right. In this post I'm going to talk about some simple "lock free" coding patterns and what's going on. In the next post we'll go through some of the simple lock free data structures.

First off let's look at the simple publication pattern and what's really going on. Thread 1 sets up some work and then "publishes" it. Thread 2 waits for the work then does it.


// shared variables :
void * shared_var1;
void * shared_var2;
bool shared_published = false;

//Thread 1:

.. set up x & y ..
shared_var1 = x;
shared_var2 = y;
shared_published = true;

// Thread 2:

while( ! shared_published ) { }
x = shared_var1;
y = shared_var2;
DoWork(x,y);

Already your spidey senses should be tingling. Let's go through and look at the problems :


// shared variables :
// !! should pad a cache line here to keep away from other pages
// !! need to align to required atomic alignment
void * shared_var1;
void * shared_var2;
bool shared_published = false; // !! <- bool is not safe because it might be a char which is not atomic on many systems
// !! should pad a cache line here to keep away from other pages

//Thread 1:

// !! need to ensure all these assignments are atomic
.. set up x & y ..
shared_var1 = x;
shared_var2 = y;
// !! make sure the compiler doesn't reorder writing shared_var1 & 2 versus published !
shared_published = true; // !! (*1) var1 and var2 must be flushed before published flag is set
// !! can never touch var1 or var2 again

// Thread 2:

while( ! shared_published ) { } // !! read must be acquire (*2)
x = shared_var1;
y = shared_var2;
DoWork(x,y);

Okay, so this simple code already has lots of issues. The exact details of those issues depends on your system, and we don't want to make the client code full of lots of details about the exact system. We want to write client code that is *algorithms* and then have adaptors so that our code is portable and efficient on various systems.

The first little issue mentioned is that the variable loads and stores should be atomic, which just means done all in one go. On the vast majority of systems, 4 and 8 byte types that are aligned to 4 or 8 bytes are atomic. Smaller types are often NOT atomic because the actual load is done by loading a word then masking down to a byte, so beware of bool! Similarly larger types are often not atomic, and 8-byte types that are not 8-byte aligned are not atomic! (on x86, anything that is wholely contained in a cache line is atomic - x86 always reads & writes whole cache lines as one atomic op - more details on hardware later). Anyway, we don't want out client code to know about this. Clearly what we want is some kind of Atomic< > template that does all this for us.

Atomic< void * > shared_var1;
Atomic< void * > shared_var2;
Atomic< int > shared_published = false;

This Atomic< > just ensures the type is a valid size to be atomic on this platform, and is aligned to the necessary alignment for this platform, it also overrides operator = to make sure the assignment is a simple atomic op.

Next we have this to deal with :


// !! make sure the compiler doesn't reorder writing shared_var1 & 2 versus published !

Remember the compiler optimizer will go nuts reordering things. Instructions don't necessarily actually happen in the order that you write code. The key thing is that *we want that*. The optimizer reordering things is what makes our code fast. We don't want to just turn off the optimizer. For example, we could make all our variables "volatile". I'm not going to talk much about the new magic VC2005+ volatile, but in the old meaning of volatile, all volatile memory ops are executive in program order. So if we make everything volatile we would enforce
shared_var1 <- x
shared_var2 <- y
shared_published <- true;
But really we would be happy to get
shared_var2 <- y
shared_var1 <- x
shared_published <- true;
Or
shared_var2 <- y
shared_var1 <- x
anything_else <- z
shared_published <- true;
We just care about var1 and var2 being done before "published". To make fast code we want the constraints to be as *precise* and as *weak* as possible for correct code.

The key point is at (*1) :


shared_published = true; // !! (*1) var1 and var2 must be flushed before published flag is set

What we are doing here is "Release" memory semantics. When you do a Store-Release it is a "publishing" store - it ensures that no earlier writes can pass the Release. Obviously this Store-Release has to encapsulate constraints on various parts of the system : it must tell the optimizer not to reorder writes, it must tell the CPU that those writes must be done in order, and it must tell the cache controllers that those affected memory locations must be flushed in order.

Note : you could accomplish this same thing with a barrier :

shared_var1 <- x
shared_var2 <- y

# Write Flush Memory Barrier

shared_published <- true;
but you don't want to ! That's way way more expensive. BTW another way to write the "Release" is in the old style load-load blocking way :
shared_var1 <- x
shared_var2 <- y
#StoreStore
shared_published <- true;
Here #StoreStore means that stores above the # cannot move past stores below the #. In terms of how you specify the program memory constraints, you can either do it with these reorder blocks between the instructions, or with the modifiers *on* the loads and stores. C++0x has chosen to do the latter, so I'll go with that (and it maps more closley to certain hardware). Java uses the #StoreStore kind of method, as does the Linux Kernel with the smp_barrier macros.

From now on I am going to writing all the Lock Free code using Atomic< > types and special stores and loads. So our basically publication now looks like :


// shared variables :
char pad1[CACHE_LINE];
Atomic< void * > shared_var1;
Atomic< void * > shared_var2;
Atomic< int > shared_published = false;
char pad2[CACHE_LINE];

//Thread 1:

.. set up x & y ..
StoreRelaxed( &shared_var1 , x );
StoreRelaxed( &shared_var2 , y );

StoreRelease( &shared_published , true );
// now don't touch var1 and var2

// Thread 2:

while( ! LoadAcquire(&shared_published) ) { } // (*2)
x = LoadRelaxed(&shared_var1);
y = LoadRelaxed(&shared_var2);
DoWork(x,y);

We haven't talked about the Acquire at (*2), but it's basically just the complement to a StoreRelease. a Load-Acquire ensures that no later loads can move up before the Acquire. In terms of the Java-style barriers these translate to :


x = LoadAcquire(ptr) :
    x = *ptr; // atomically
    #LoadLoad

StoreRelease(ptr,x) :
    #StoreStore
    *ptr = x; // atomically

Note the way they put the barrier on opposite sides of the memory op. (Also recall this is NOT an actual memory "barrier" instruction - it is a blockage for the memory ops, which may or may not actually requires a memory barrier on your platform).

Now, obviously Acquire and Release are almost always what you want :


localVar = LoadAcquire( &shared_var );
// acquire prevents any loads inside this block from moving up

x = localVar;
y = z;
/// etc. ...

// Release prevents any stores in here from moving down
StoreRelease(&shared_var,localVar);

So for example, entering a Mutex should obviously be an Acquire, and releasing the Mutex should be Release - that way work you do inside a mutex is done while you own the mutex (or it looks like it was done while you own the mutex). Everyone who uses critsecs is secretly using Acquire/Release to wrap their critsec work.

I should note that there is another type of load barrier called "consume" or smp_barrier_depends() in Linux ; this is only necessary on processors that don't automatically protect dependent reads. The only known processor that doesn't do that is the Alpha, so I'm not going to deal with that issue. If you care about full correctness you should also go through and mark up barriers that do dependent reads. It's really a strange thing, it means :


object = g_shared; (*1)
# consume memory barrier
int x = object->m_var; (*2)

On Alpha, "object->m_var" can be read "before" g_shared is read. WTF !? Obviously it can't actually be read before object is read, because you need the address of object from (*1) before you can even execute (*2)- but it can read older memory at (*2) than it did at (*1). That is, on the Alpha cache lines are not temporally ordered. So say some other processor has changed the address of object and also the data inside object. You read at (*1) and for some reason you get a fresh new cache line there and get the new value of object. When you read at (*2) you use the new value of object of course, but the cache line it points to is old and it hasn't refreshed, so you read the old value. Thus (*2) acts as if it happened earlier in time than (*1).

Like I said, all existing significant chips do not allows this, so we're not going to worry about it, but you can see how all kinds of different parts of the hardware can cause operations to act like they are going out of order. The goal of our Acquire/Release type barriers is to make instructions act as if they are going in order whether they actually do or not. We are trying to specify the memory constrains semantically in our code, and we want the code generator to make the right stuff for the architecture. In C# or Java this is easy to do because they have memory model semantics, in C++ it's messy, you have to write your own. C++0x will fix this but sadly that's a ways out for most of us.

Now I'll give you a bit of a punchline. On x86 LoadAcquire and StoreRelease are both just "mov". That is, they are 100% free. All (atomic) movs on x86 are acquire and all stores are release (ignoring SSE). It is still important to mark up your code semantics with acquire and release for portability, and also so that the compiler actually generates a mov and keeps your other operations inside. For example on MSVC x86 you can implement them like this :

 
x = LoadAcquire(ptr) :
    mov x , *ptr;
    ReadBarrier(); // (*1)

StoreRelease(ptr,x) :
    WriteBarrier(); // (*2)
    mov *ptr , x;

Note that the "Barriers" at *1 and *2 are not actual memory barriers, they generate ZERO instructions - those are compiler barriers that stop the compiler from moving stuff around. That is an implementation detail and your client code shouldn't have to know about.

This should clarify the win from using memory model semantics. We asked for what we needed - Acquire and Release - and we got them *for free*. Consider instead if we had not asked for them and had used a true memory fence instead - we would be adding 100+ clocks of expensive for no reason.

On Itanium LoadAcquire needs to actually generate a "ld.acq" which is a bit more expensive than a relaxed load, but not nearly as expensive as a memory fence. On PowerPC an implementation might be :


Load Acquire     ld; cmp; bc; isync
Store Release     lwsync; st

While I'm on this topic I should mention that the x86 has this super magic where loads can actually be invoked out of order, but they are still retired *as if* they went in order (they may not actually be retired in order). Because of this a lot of people incorrectly say "the x86 can reorder loads". Well, yes, and no. From a multi-threaded programmer's point of view the loads are not semantically reordered, however yes in terms of how the processor actually executes them, yes they do go through the reorder buffer and everything. In order to make this work, the x86 receives notifications whenever a cache line changes, and it invalidates reads that were speculatively done out of order in the wrong order and causes them to be redone in the right order. The x86 reorder buffer can also reorder loads against stores when they do not overlap. Again this usually does not affect semantic program flow (note that LoadAcquire and StoreRelease don't say anything about the #StoreLoad barrier).

Okay this has gotten long enough so we'll continue in the next part.

02-24-09 - Low Level Threading - Part 4.0

Basic lock-free algorithms

We want to write some lock free algorithms to mutate possibly shared data. (Obviously the best possible lock free algorithm is to work on thread local data; hey it's lock free! the next best lock free method is to work on shared read-only data; hey you can read it without locking; so we assume you've done that as much as possible and what remains is the threads need to write something shared). The general concept that we will follow in all cases is the idea of having a constraint on the data structure to preserve its validity, and then doing operations that keep the data structure valid every step of the way.

Other threads may see the changes you are making at any point of your flow, so if you mutate the data in some way that messes up the data structure, and then fix it on the next line, they might get the bad stuff. So every single operation needs to atomically maintain the validity of the data structure.

So for example if data is something stupid like Data { int x; int must_be_negative_x; } , if you write code like :

Data.x = 7;
// (*)
Data.must_be_negative_x = -7;
At the (*) you can get swapped and Data is in an invalid state. (Some consumer is sitting around doing int zero = Data.x + Data.must_be_negative_x).

One way to do this that's most obvious is with large atomic operations that swap the data structure from one good state to another. If you pretend for a moment that we can work on arbitrarily large objects, we could just copy the whole data structure, change it to the new state, and then write it back :


Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...
StoreRelease( &g_sharedData, newData );

Note that using Acquire/Release here ensures that anything we do in "mutate" is properly scoped. This correctly takes sharedData from old to new in one atomic step so that it stays consistent. However, something else bad can happen. While we were working on it, somebody else may have put their change in too. We don't want to just stomp on their change with our Store. The solution is to use CAS and retry :

do
{
Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...
} while ( ! CAS_Release( &g_sharedData, localData, newData ) )

Here we take a local copy, make some changes to bring it to a new state, and then we try to publish it to sharedData. However, when we publish, we check to make sure that sharedData is what we saw when we started - it must be equal to localData or we will not write it. If someone else changed it while we were working, we go back to the start and try again. (publish means make visible to other threads; at the time that you publish you must ensure that what you have written is in a consistent valid visible state for other threads).

This is a "change-cas-retry" loop and it's a very common pattern in lock free programming. It ensures that we change data atomically so it is made visible as a unit, and also that all threads that are changing Data get to make their changes public - nobody blindly stomps on someone else's change without looking at it.

Now, as handy as the CAS-retry loop is, it's not really great lock free programming. If you look at it a second you can see that it's basically the same as a spin lock mutex. If there is no contention, then the CAS-retry steps right through just like linear code and the CAS succeeds and we're done. If there is contention, someone else wrote data, we loop and try again until there is no contention. Compare to this :


int mutex = 0;
Data g_sharedData;
    
    while( ! CAS_Acquire(mutex,0,1) )
    {
    }
    Data localData = LoadAcquire( &g_sharedData );
    .. mutate localData ...
    StoreRelease(&g_sharedData,localData);
    StoreRelease(mutex,0);

Again we do a CAS, if there's contention we loop, once there's no contention we run through the linear code and publish. The CAS-retry loop is maybe slightly faster than the simple mutex, but if each thread actually get a hardware core and they are both running, then the difference is very small. The big advantage of the CAS-retry method over a plain old mutex comes from the fact that applications and operating systems are very variable.

Consider two threads that are fighting over g_sharedData, both trying to write it. Thread A is in the middle of working on sharedData when suddenly it gets stopped. There are various reasons it might get stopped, maybe someone suspended its thread, or maybe it just got swapped out to run another thread on that core, or maybe it had some OS kernel problem that stopped it (like it crashed). (BTW I'm assuming that you know you should not be doing things like File IO inside a mutex - in fact you should avoid any OS calls insidea mutex or anything that might stall or swap your thread; always hold mutexes for very short simple operations only). In any case it is stopped right in the middle of working on sharedData. If we used the simple mutex, then thread A is now blocking all other threads from working on sharedData. Our app is halted until thread A can run again. In the CAS-retry method, thread B can go ahead and do its work - it will see no contention because thread A is no longer changing sharedData.

One of the tricky things we have to deal with is that we have threads that are running on real separate hardware cores so they are stepping along at the same time, and if your locked area is short they will not ever stall on each other for long. But we also have threads that are running on cores with other threads, and those threads do NOT run at the same time, so one of them can appear to be frozen for a very long time when it's swapped out while the other runs. (there are a lot of other ways that these scenarios are quite different; in general we have to handle both and wind up making compromises; for example with a real hardware thread per core there's no point in ever sleeping at all, with lots of threads on one hardware core you have the exact opposite issue - spinning at all is pointless, and lots of micro work can cause thrashing such that you spend almost all your time in thread switches, and things like backoff become crucial).

If you like, you can imagine that a CAS-retry is a lot like a spin-lock which is only held while the contended item is actually being worked on ! BTW in both cases here you don't need to sit and spin when there is contention. You can see there's contention and just abandon the change and do something else. It's somewhat rare that that is actually practical. (this is like TryEnterCriticalSection for those from that world). CAS-retry only spins as many times as the number of times that someone else updates the data.

Now of course in the real world we can't atomically change a large data object, usually we are limited to 32 or 64 bits of atomic data. So the solution of course is to use a pointer and to CAS in the new pointer. This is pretty trivial but you do have to keep in mind that you are consuming & publishing a pointer, and that once you release it, you no longer own it :


for(;;)
{
const Data * localData = LoadAcquire( &g_sharedData );
// localData is a pointer to *shared* read-only data !
// I do not own it, it is owned by g_sharedData

Data * newData = new Data;
.. mutate newData ...
// I make my own local data object to prepare the next revision
// I can do anything I want in here on newData 

newData->stuff = Compute( localData );

// now I am publishing newData -
//    if the CAS succeeds the newData instantly becomes shared
//  and anyone else can be reading it
// the CAS here must have Release memory order so that the memory ops
// I just did in mutate are made public before the change to g_sharedData

if ( CAS_Release( &g_sharedData, localData, newData ) )
    break;

// newData was NOT published so I am still the exclusive owner here
delete newData;

}

(obviously you wouldn't actually new/delete each time, this is just illustrating the flow of ownership).

Next time we'll use CAS-retry to make a simple LIFO stack.

Oh, let me also take a second to introduce the idea of the "Abort". I find it easier to talk about many Lock-free algorithms this way. A lot of the times in LF code we go to change something or check something, we hope to find something out, but we see there was contention so the values are changing and we can't get a good result, we just Abort and retry.

So for example the C++0x CAS would be :


LF_Result CAS(T * pVar, T & oldVal, T newVal)
{
    if ( ! trylock(pVar) )
        return Abort;
    
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    

    unlock(pVar);

    return match ? Success : Failure;
}

We now return a "LF_Result" which is {Abort, Success, Failure}. Abort means ignore the results and just retry because there was contention. One way to think about a lot of Lock Free algorithms is to imagine that your old deterministic code which only had two boolean possibilities now has 3. (also "void" deterministic code that had no return value at all now has {Abort,Success} options).

As I mentioned before the usual recourse when you get an Abort is just to retry, and you are gauranteed to only get Aborts as long as there is active contention. But you could also take the Abort and do something else entirely.

2/23/2009

02-23-09 - Low Level Threading - Annotated Linkies

Part 1 : Background

Stuff about C++0x. I now believe that C++0x is the right way to talk about multi-threading. You require the basic type to be atomic (eg. fit in a machine word), and then you add specific memory order constraints to each operation - load, store, or exchange - that constrain the program to be correct. The goal is writing good lock-free code is to find the minimum necessary and sufficient memory constraints to make the program correct. The lighter the constraints that you need to apply, the more efficient it can be.

Multicores and Publication Safety � ��Bartosz Milewski�s Programming Cafe
C++ atomics and memory ordering � ��Bartosz Milewski�s Programming Cafe
Hans Boehm : Why atomics have integrated ordering constraints

Stuff on the basics of "volatile" and memory ordering and how to make the compiler do what you want. The summary is that "volatile" has changed meaning so much that you should probably just not use it at all. The main issues to worry about are : compiler reordering and what the compiler does with your code, memory ordering constraints (cache/bus locking and flushing constraints), CPU reordering of instructions and memory timing vs cpu instruction timing.

It Goes To Eleven The Joys of Compiler and Processor Reordering Why You Technically Need Read-Side Barriers
Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Intel - Volatile : Almost Useless for Multi-Threaded Programming Memory ordering, barriers, acquire and release semantics niallryan.com
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Dr. Dobb's volatile vs. volatile January 8, 2009
Acquire and Release Semantics
Memory barrier - Wikipedia, the free encyclopedia
Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8 Visual C++ Language Visual C++ MSDN Forums

The MSDN stuff is all good :

Understanding Low-Lock Techniques in Multithreaded Apps
Rules and Limitations for TLS
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Synchronization and Multiprocessor Issues (Windows)
Concurrency What Every Dev Must Know About Multithreaded Apps
Break Free of Code Deadlocks in Critical Sections Under Windows
Under the Hood Reduce EXE and DLL Size with LIBCTINY.LIB
Multiple Approaches to Multithreaded Applications - Intel� Software Network

I think that "cmpxchg" on x86 with no lock prefix is equivalent to a C++0x compare_exchange(memory_order_acq_rel) , as opposed to "lock cmpxchg" which is memory_order_seq_cst :

Boehm, Hans - locks on X86
x86 synchronization primitives

This next link I believe holds the secrets to the mysteries of the x86 load reordering. All loads on the x86 are "acquire", which means they appear to go in order. (an "acquire" load means that later loads cannot move before the acquire, but earlier ones could move after it - when all loads are acquire it means they all go in order). However, we all know the x86 can speculatively execute loads out of order. WTF !? Something is seriously wrong with those two facts. In fact, loads are executed out of order, but if writes cause any changes during that time, they are invalidated and the loads are reissued :

details of x86 load reorder

This post gives you some idea of what the processors have to do to enforce the appearance of in-order loads. Also note : "Any memory access that falls within a single cache line is guaranteed to be atomic". (x86 specific)

Gustavo Duarte talks about Cache details

General note : I find it very valuable to understand what is going on under the hood. HOWEVER you should not write code that relies on the details of one processor or OS or compiler implementation. The code should be descriptive about the constraints that it needs, then you can have library routines which enforce that model for the specific OS/compiler/CPU.

Some general articles about how lock-free and CAS are not a panacea. It's important to remember that a CAS (any "lock" instruction) is a bus lock which is very slow - not just for your processor, but for all other processors looking at that cache line! In even moderately contended algorithms, the cost of CAS for N processors goes up like O(N). The means if you tried to implement a huge multiprocessor machine where people used Interlocked ops to synchronize, as you added more and more processors eventually the whole thing would be running in lockstep with every processor waiting on the bus all the time. In fact, a CAS-spin-retry is really fundamentally not much different than a short Critical Section.

Joe Duffy's Weblog - CAS Performance kills you
Dr. Dobb's Maximize Locality, Minimize Contention May 23, 2008
Dr. Dobb's Lock-Free Code A False Sense of Security September 8, 2008
What is the real cost of LOCK on x86 ?

Good article about how Windows SLIST deals with 64-bit pointers and the requirement to do CAS with only a 64-bit atomic in early x64 chips. With 32-bit pointers we often make a {pointer,counter} object in 64 bits and use cmpxchg8b. New chips have cmpxchg16b. Sadly there was a period when Intel made 64-bit chips with no cmpxchg16b.

Alex Ionescu�s Blog � Blog Archive � Behind Windows x64�s 44-bit Virtual Memory Addressing Limit

A bunch of Herb's junk : Herb develops a lock free queue (FIFO) which is basically just a SPSC queue, and then makes an MPMC variant by putting a lock on each end. This is a general technique - any SP or SC structure can be changed to MP or MC by adding a "producer lock" or "consumer lock". SP or SC work by knowing that only one owner is touching that end at a time. That "owner" can either be a specific thread, or it can be designated as the holder of the lock.

Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007
Dr. Dobb's Lock-Free Queues July 1, 2008
Dr. Dobb's Writing Lock-Free Code A Corrected Queue September 29, 2008
Dr. Dobb's Measuring Parallel Performance Optimizing a Concurrent Queue December 1, 2008

Big collections of articles :

Dr. Dobb's Go Parallel Features
Computer Laboratory -
AMD Developer Central - Articles & Whitepapers
Sun Scalable Synchronization Research Group
Sun Labs Scalable Synchronization Research Group Publications

Worker pools and queues :

Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
Re Non-blocking queue for the 2 threads
Joe Duffy's WSQ 1
Joe Duffy's WSQ 2
Joe Duffy's WSQ 3
Creating a thread safe producer consumer queue in C++ without using locks - Cluebat-man to the rescue

Some shit about singletons and such :

Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Double-checked locking - Wikipedia, the free encyclopedia
The Old New Thing C++ scoped static initialization is not thread-safe, on purpose!
The Old New Thing An auto-reset event is just a stupid semaphore

Major lock-free code bases :

Paul E. McKenney Read-Copy Update (RCU)
Atomic Ptr Plus Project
TBB Home
Download details Parallel Extensions to .NET Framework 3.5 June 2008 CTP
Getting Started with the .NET Task Parallel Library Multi-Core Case Studies
Intel� Performance Libraries - Intel� Software Network
Intel� Thread Checker 3.1 for Windows - Intel� Software Network
Intel� Thread Profiler 3.1 for Windows - Intel� Software Network
Synchronization Algorithm Verificator for C++0x - Intel� Software Network
transforming single-threaded allocators... - Intel� Software Network

Some specific code samples (WARNING : OFTEN WRONG !! I have yet to find any reliable source of lock-free code on the net that you can just grab and use)

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Kernel space Ticket spinlocks - LinuxWorld
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
CodeProject Single Producer Single Consumer, with no critical sections. Free source code and programming help
Lock-Free Linked List
SourceForge.net Spin-X Engine
Lock free double ended queue .mischief.mayhem.soap.
Single producersingle consumer bounded queue .mischief.mayhem.soap.
Lock-free, multi-threaded Mandelbrot .mischief.mayhem.soap.

ADDENDUM : Found some more goodies : Java Memory Model guide for compiler writers by Doug Lea. This is a very good overall discussion of memory models and how to implement them on various architectures.

Example implementation of C++0x on Power architecture by Paul McKenney.

fences in C++0x by Peter Dimov.

Joe Duffy on why critsecs need fences

Bartosz on the rare need for fences on x86


Part 2 : V'jukov and Thomasson

Now we get into the part that's actually useful. (note the are places where they "sketch" algorithms - obviously don't try to grab a sketch code snippet and stick it in your code and assume it works)

Any lock-free code that has not been tested with Relacy is basically worthless. Anybody can write lock-free code and it "seems to work" - it's shit, it's garbage, it's no help at all. The hard part is in the testing and absolutely being sure it's correct, just sketching up something that "looks right" can be done by any monkey.

Relacy Race Detector Google Groups

Dmitriy V'jukov who wrote Relacy also has lots of great posts around the net. The most valuable information we will find is by him and one other guy :

Chris Thomasson, author of AppCore : the code in AppCore is mostly ASM and x86 specific. Chris has lots of great writing on lock-free stuff around the net, some more links to him will follow. He believes that doing a lot of this in ASM is easier because you can see the exact ordering of ops and don't have to worry about what kind of madness the compiler is doing. I think he's right to some extent, I wind up spending half my time trying to ensure the compiler generates the right single instruction. For example his SPSC queue is incredibly simple and efficient.

AppCore A Portable High-Performance Thread Synchronization Library
vzoom refcount - mostly lock-free word-based atomic reference counting

Our two heroes often reside here :

Scalable Synchronization Algorithms Google Groups

Discussion with Thomasson where he describes how the Fober LIFO is broken and how to fix it : (also contains his x86 ASM for a LF LIFO and small mention of the fact that stack can touch freed memory, but in a benign way, so that just catching the access violation and making it a retry is a good fix) :

Solving the memory issue of lock-free stack

Thomasson sketch of work-requesting system : (unlike work-stealing , work-requesting lets you used a non-thread-safe stack for your own work items) :

VERY crude work-requesting environment... - comp.programming.threads Google Groups

Thomasson and V'jukov discuss the SPMC WSDQ of Hendler & Lev :

ABP stealing deque algorithm - comp.programming.threads Google Groups

Thomasson sketches a LF SPMC WSDQ (ignore all the junk posts at the top and scroll down to Thomasson and V'jukov) :

Single producer, one-shot work queue implementation, no lock on queue. - comp.programming.threads Google Groups

In this post Thomasson sketches the method of using an MPMC stack and just exchaning the head with NULL to grab the whole thing, then reversing it to make it FIFO instead of LIFO. This is a very efficient and simple way to do an MPSC FIFO. Also more interesting discussion by V'jukov at the bottom about garbage collection and such :

Wait free queue - comp.programming.threads Google Groups

A related thread - this starts off with the MPSC method using stack-swap, but then V'jukov drops a gem of a new MPSC method :

A Correct, Lock-free FIFO with a single CAS per complete putget cycle - Scalable Synchronization Algorithms Google Groups
V'jukov intrusive MPSC node FIFO here

Another WSDQ where Thomasson and V'jukov try to straighten things out :

Double ended queue - review - comp.programming.threads Google Groups

Again discussion starts off with mystification about the x86 details, and leads to Thomasson posting his SPSC and straightening out the clowns :

Memory barrier + volatile (C++) - comp.programming.threads Google Groups

V'jukov's very cool node allocator for FIFO queues that does no atomic ops at all ! (amortized). FIFO's are a very bad case for standard multi-threaded allocators like tcmalloc, because every single block is being allocated on one thread and freed on another, which requires them to do a cross-thread pass-back (or causes heavy slab contention).

Lock-free node allocator for fifo-queues - Scalable Synchronization Algorithms Google Groups

V'jukov's simple example of an SPSC queue :

Single-ProducerSingle-Consumer Queue - Intel� Software Network

V'jukov talks about CAS performance and issues of MPMC and other structures, mentions SList exception catching technique : (go to page 2)

Concurrent stack - Intel� Software Network

Finally : Thomasson's slab allocator

Dmitriy sometimes uses the opposite meaning of "head" and "tail" from how I think and it tilts the hell out of me. I always thing of "head" as where you push, and you pop from the tail. So like stacks only have a head, you push & pop the head, FIFOs you enqueue on the head and dequeue on the tail, he usually does the opposite. That is, in a singly-list linked I think of links as pointing from the head towards the tail, so if you walk links you get from the head to the tail.

ADDENDUM : Found some more goodies :

Thomasson describes vZOOM architecture

V'jukov Memory Passing Queue good for thread caching allocators passing blocks back to the main thread for example, but also other things.

2/22/2009

02-22-09 - Linkies

If you're not already following Drew and Ryan's Void Star Blog , get over and check out the awesome new video and Drew's nice render layer writeup.

Maciej posted this but it's so fucking great it's worth reposting : Gustavo Duarte writes about hardware and kernels with awesome diagrams.

Game Tunnel does a Monthly Round-Up of Indie Games that's not bad.

Peter Lindstrom - compression for graphics
Pedro V. Sander - parameterization, simplification, meshes, etc.
Very good.

Ishani.org - 2d level editor tools
This is a free 2d tile map editor for games (source code). I found it cuz I was searching for "Alien Breed" cuz the Void Star space game kind of vaguely reminds me of it.

compression.ru lossless image papers

bomb visual music
"Bomb" is one of the first interactive visual composition "instruments".

The Durrr Challenge is on. It's still early so not terribly exciting, but you can follow it here :
FT - durr challenge
Durrrr Challenge

2/19/2009

02-19-09 - Thread Safety Levels

I am trying to mark up my code with explicit comments about 3 levels of thread safety. I think this is a good concept that I haven't really seen discussed much :

Completely Thread-Safe (CTS) :

This function touches only local variables, objects through locks, objects which are intentionally & safely lock-free, TLS variables, and other stuff that is totally thread safe.

Object Thread-Safe (OTS) :

These functions can touch anything that is CTS, and also can touch any objects passed into them. If the objects passed into them are completely owned by the caller, then they are CTS. Class member functions should typically be OTS for example.

Not Thread-Safe (NTS) :

These functions touch globals or something and are just not thread safe. You must ensure they are run in sequential order.

So, for example, most of my init & shutdown code is NTS. I assume that you do inits, then start threads, then kill threads, then do shutdowns.

It would be really awesome if I could mark up functions in C++ with extra constraints. Then I could stick "CTS" on a function decl and the compiler could tell me if it does anything that doesn't comply with the CTS constraint. OTS can call CTS. NTS can call anything. If CTS calls NTS, it's a compile error.

Another thing that would sure be handy is a way to find all statics and globals. Fucking C++ has reused the reserved words so much, there's no word at all for globals, and "static" is used for so many things it's not a very useful search.

BTW another level that might be needed is Single-Thread-Safe :

These functions are thread-safe only if they are always called from the same thread. That does not necessarilly mean that they are only touching data which that thread exclusively owns - they may be touching shared data, but in a careful way that works only if only one thread is writing the data. One obvious example is the lock-free single-producer data structures. They are not really CTS because if you call Push on them from any thread but the owner you are borked.

02-19-09 - PNG Sucks

If you're a game developer and you want to ship your game with compressed lossless images (which you should want - it makes the install faster and smaller, and you take less space on disk, and your game loads faster because decompression is faster than disk IO - it's win win win) - PNG is a natural choice. The good thing about PNG is it's widely supported now, and it's relatively simple (though still ungodly complex for what it does). The bad thing is it really sucks for compression in silly ways.

One problem is that the actual LZ is Zip. Zip is really bad old technology. I know they used it cuz its own, blah blah. Anyway. You are probably already compressing your game content with some kind of LZ (you should be). Maybe CAB or LZMA. Both of those knock the living piss out of Zip. But if you Zip the image first (via PNG) then CAB or LZMA can't work on it. Instead, just leave your image alone - leave it in BMP - and let the CAB/LZMA/whatever do its LZ !

The only thing missing is the "DPCM" (silly name for doing delta from neighbors). Again PNG has some pretty awful choices for DPCM filters.

We could make a much better DPCM using a larger neighborhood, but for speed and simplicity we will only consider the 1-ring (like PNG), that is N, W, NW :


NW  N
W   ?

There are only two logical symmetric values that we can make from our neighborhood :

grad = N + W - NW;
avg  = (N + W)/2

"grad" is the gradient predictor, the simple planar fit. The set of all symmetric linear predictors can thus be specified by one parameter :

pred(t) = avg * t + grad * (1-t)

PNG provides pred(t) for t = 0 and 1, but 0.25 , 0.5 amd 0.75 are all very good values too (and can be implemented with shifts), and in fact they beat 0 and 1 very handily on most images.

Note that "grad" is actually a perfect predictor for horizontal and vertical stripes. That is :


A A
B ?

predicts B

B A
B ?

predicts A

The only time you would actually want a "N" or "W" pure predictor is in a weird case where pixels were correlated in one direction but not the other. For example if you had an image made from interlacing many images such that each horizontal line came from a different source image, then the "W" predictor would win. I have yet to find a single image where pred(t) is not best or very close to best.

Note that you could also obviously do an adaptive DPCM predictor in the style of CALIC that looks at the neighborhood edges and gradients and ranges and chooses different predictors. But that's an awful lot of complexity and would make it hard to put our simple DPCM into MMX or whatever.

So, this is what I'm roughly putting in Oodle for lossless texture compression. I just run a DPCM on the texture samples to turn them into deltas, and then I just jam the texture into my file stream, which LZ compresses everything. It's very easy to do, it's very fast, and it handily beats PNG.

ASIDE : this is only tangentially related, but the whole idea of the "planar" gradient predictor is very useful for predicting and extrapolating images. I used something similar in Galaxy3 to do normal map extrapolation (more about this in the next post). Obviously you can go past 1st-derivative prediction and do 2nd-derivative quadratic gradient prediction. I recently discovered a really sweet paper by Peter Lindstrom on Spectral Predictors which tabulates the best continuous predictor for various support shapes. (in image compression you always have the simple L-shape support because you are scanning in that order, but in other applications like normal map extrapolation you can have any possible support shape).

ADDENDUM : I should note there are like a million ways to beat this. That's not the point. The point is that it's *so* easy to do this, and it's very fast, so why not. Some obvious ways to beat it :

Divide image into 32x32 tiles and pick the best DPCM on each tile. To pick a DPCM do the least-squared best fit to find the optimal linear predictor. Bust least-squares is minimizing MSE which is not what you want, you want the minimum-rate predictor, so use MRP. Use more than a 1-ring neighborhood. etc. etc. (BTW none of this really changes the complexity of the decoder much, but you are definitely getting into the long compression tail of diminishing returns).

02-19-09 - Two Code Gems - Appendium

BTW similar to the array size thing is the Zero'er macro. It's very common to write a macro that will zero a struct out, like :

#define ZERO(ptr)  memset(ptr,0,sizeof(*ptr))

BITMAPINFOHEADER bih;
ZERO(&bih);

However, this ZERO() macro is very easy to use in catastrophically bad ways. (and no I'm not talking about invalidating C++ objects or anything - I'm assuming that you are using it only on things that *can* be zeroed, you just accidentally use it wrong).

Exercise : what happens in this code ?


DWORD test1[4] = { 1,1,1,1 };
char * test2[4] = { "","","","" };
char * test3 = "";

ZERO(test1);
ZERO(test2);
ZERO(test3);

(don't use a compiler to cheat).

And how do I write a better ZERO macro ? (I don't actually know the ideal answer to this question, so it's not entirely condescending rhetorical douchebaggery).

old rants