5/31/2012

05-31-12 - On C++ Atomic Fences Part 2

Last post we talked about what C++0x fences are. Now we'll look at what they are not.

(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers will change in the future to be more aggressive about allowing relaxed memory ordering).

The C++0x fence is not visible to other threads unless they specifically schedule against the fence. That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of a real CPU fence.

A very common example in old-school code is to use a CPU barrier for publication. Something like this :


Set up stuff
Barrier
Publish stuff

other threads :

if stuff is published
  then reading stuff should see set up

this does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :

atomic<int> x; // stuff
atomic<int> y; // publication flag

x & y initially 0

thread 0 :

    // set up stuff :
    x($).store(1,mo_relaxed);
    // barrier :
    std::atomic_thread_fence(mo_seq_cst,$);
    // publish :
    y($).store(1,mo_relaxed);

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

this does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the fence, for example this would work :

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // added -
    std::atomic_thread_fence(mo_acquire,$);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's which creates a sync point for all of them (not exactly true, but effectively true). When thread 1 sees that y is non-zero, then the store to y on thread 0 must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way of connecting a "happens before/after" to it. In this case we can say :

thread 1 reads x after
thread 1 loads y with value == 1 after
thread 0 stores y with value == 1 after
thread 0 does a barrier after
thread 0 stores x

(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization at all, because causality and load-consume takes care of it for you).

Okay. We can see the exact same thing in a more complex example. This is Dmitriy's example Peterson Mutex . Here's my test harness :


struct test7 : rl::test_suite<test7, 2>
{
    // the mutex :
    std::atomic<int>    m_flag[2];
    std::atomic<int>    m_turn;
    // something protected by the mutex :
    std::atomic<int>    m_data;

    void before()
    {
        m_flag[0]($).store(0);
        m_flag[1]($).store(0);
        m_turn($).store(0);
        m_data($).store(0);
    }
    
    void after()
    {
        int d = m_data($).load();
        RL_ASSERT( d == 42 );
    }
    
    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);
        m_turn($).store(other,std::mo_release);

        // ** Barrier here **       
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }
    
    void unlock(unsigned tidx) 
    {
        m_flag[tidx]($).store(0,std::mo_release);
    }
    
    void thread(unsigned tidx) 
    {
        lock(tidx);

        // do m_data += 7; m_data *= 2;
        int d = m_data($).load(std::mo_relaxed);
        m_data($).store(d+7,std::mo_relaxed);
        d = m_data($).load(std::mo_relaxed);
        m_data($).store(d*2,std::mo_relaxed);
        
        unlock(tidx);
    }
};

This is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't actually provide mutual exclusion, eg. the code doesn't do what you expect it to).

The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a seq_cst fence) goes like this :


thread 0 : starts lock
thread 0 : flag[0] = 1
thread 0 : turn = 1

thread 1 : starts lock
thread 1 : flag[1] = 1
thread 1 : turn = 0 ; (overwrites the previous turn=1)
thread 1 : barrier
thread 1 : load flag[0] ; sees 0 - old value (*!)

thread 0 : barrier
thread 0 : load flag[1] ; sees 1
thread 0 : load turn ; sees 0

(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a normal CPU you could say :

load flag[0] is after
barrier, which is after
store to turn = 0, which is after
store to turn = 1, which is after
store flag[0] = 1

therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the later load of turn on thread 0 sees a 0).

So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object. So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn into an RMW :


    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);

        m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }

and that works (it's the same thing described here : Peterson's lock with C++0x atomics - Just Software Solutions among other places).

Some links on the topic :

Subtle difference between C++0x MM and other MMs
Subtle difference between C++0x MM and other MMs - Page 3
stdatomic_thread_fence - cppreference.com
Relacy finds fence bugs in spsc
Re questions about memory_order_seq_cst fence 2
Re questions about memory_order_seq_cst fence 1
Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1

5/30/2012

05-30-12 - On C++ Atomic Fences

C++0x's atomic_thread_fence is weird. Preshing asked some questions and pointed out some errors in this blog post which has got me to look into it again.

(caveat : I'm no standards-reading afficionado, and I find the C++0x rules very odd, so this is as much me learning out loud as anything).

I'm going to do this post a bit backwards; first some random notes.

1. I never use C++0x fences. In all the lockfree atomic code I've written I've never used them or wanted them. I put the memory ordering on the operations; it's easier to understand and usually more efficient (because doing something like making an exchange be acq_rel is a no-op on most platforms, exchange is already acq_rel, whereas adding another fence requires another hard sync to be issued because compilers are not yet sophisticated enough to merge atomic ordering operations).

The only case I have ever seen where a fence is better than putting the constraint on the operation is for the optimization of doing a relaxed op to detect that you need the ordering. Something like :


if ( atomic_ticket.load( mo_relaxed ) == me )
{
  std::atomic_thread_fence( mo_acquire );  // make previous load act like acquire
  ... do stuff on my ticket ...
}

is faster than :

if ( atomic_ticket.load( mo_acquire ) == me )
{
  ... do stuff on my ticket ...
}

this can be used for example with a ref counting destructor; you don't actually need the "acquire" until the refs go to zero. Which brings us to the next note :

2. The way to understand C++0x fences is to forget everything you know about CPU's, caches, what the CPU memory fence instructions do, any connotation in your head about what "barrier" or "fence" means. The people who are most confused about it are the ones who had some background writing lockfree assembly in the pre-C++0x days, because C++0x fences are really not what you are used to.

What C++0x fences really do is provide more sync points through which other C++0x atomic ops can create "happens before" relationships. You can heuristically think of them as modifying the neighboring ops, not as being an independent operation themselves. In particular :


An "acquire" fence can make a preceding load act like a load_acquire
A "release" fence can make a following store act like a store_release

(an acq_rel fence obviously does both)

A "seq_cst" fence provides an entry in the program total order ("S")
  then preceding loads & following stores can be located relative to that point in the order S
  (eg. either "happens before" or "happens after")

Actually the fences are rather like the old-school way of marking up memory ordering constraints. eg. instead of :

x.load( mo_acquire );  // #LoadLoad follows acquire load

you used to write :

x.load();
#LoadLoad

which is more like the fence method in C++0x :

x.load( mo_relaxed );
fence( mo_acquire ); // #LoadLoad

Errkay so let's get into more specifics.


Section 29.8 of C++ doc N3337 :

29.8.2 : A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifies M, after A

Y is an atomic op on M, Y reads the value set at X, before B
B is an acquire fence


is this :

    m_x and m_y initially zero
    
    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$); // "A"
            m_y($).store(1,std::mo_relaxed); // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "Y"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Roughly what this says is the "release" ordering of a fence synchronizes with the "acquire" ordering of another fence if there is a shared variable ("M") that connects the two threads, after the release and before the acquire. Or, if you like, the release fence before the store to M makes the store act like a store-release, and the acquire fence after the load of M makes the load of M act like a load-acquire.

(just using a store-release and a load-acquire is the normal way of doing a publish/consume pattern like this)

29.8.3 : A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifes M, after A

A synchronizes with B :

B is an atomic op on M, reads the value written by X


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$);        // "A"
            m_y($).store(1,std::mo_relaxed);                    // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }

            int y = m_y($).load(std::mo_acquire);  // "B" synchronizes with "A"
            RL_ASSERT( y == 1 );
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

This just says the same thing but that you can substitude the acquire fence for a load-acquire. That is, a release fence can synchronize with a load-acquire just as if it was a store-release.

29.8.4 : An atomic operation A that is a release operation on an atomic object M synchronizes with an acquire fence B if there exists some atomic operation X on M such that X is sequenced before B and reads the value written by A or a value written by any side effect in the release sequence headed by A.


A is an atomic release op on object M

X is an atomic op on M, before B, reads from A
B is an acquire fence


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            m_y($).store(1,std::mo_release); // "A"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "X"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Again the same thing, just saying an acquire fence can synchronize with a store-release.

That's about for interesting stuff in 29.8 , there's a bit more in 29.3 on fences :

29.3.5 : For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
B reads M , after X

Then B cannot see a value of M from before A. 
Well, duh. Note that of course this is only non-trivial when A and B are done on different threads. And B being "after X" usually means B is after something else in the total order S which you know to be after X.

This sort of says that a seq_cst fence acts as a #StoreLoad. Note however that both A and B must have "happens before/after" relationships with the seq_cst fence. If only one of them has that relation then it doesn't work. We'll revisit this in the next post when we talk about how seq_cst fence doesn't behave exactly as you think.

29.3.6 For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B reads M , after Y

Well, duh.

29.3.7 For atomic operations A and B on an atomic object M, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B occurs later than A in the modification order of M.

A is an RMW on M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B is an RMW on M , after Y

then B is after A on the RMW order of M.

There are two forms of very strong ordering guaranteed in C++0x. One is the total order (called "S") of seq_cst ops (aside : I believe that seq_cst fences count as independent entries in the list S, but I don't see anywhere that the standard actually says that). The other is the modification order of an atomic object. There is guaranteed to be a single consistent modification order to every atomic, though not all threads may see the same order, depending on how they view it. Furthermore, any Store without a read (not an RMW) clobbers the modification order, because it wipes out the history and makes it impossible to know what the order before the store was. But you can say something strong : as long as you only use RMW ops on a variable, then it has a single global order of changes, and every thread can only view those changes in order.

eg. say a bunch of threads are doing RMW Increment ops on a shared variable. Then the variable must go {1,2,3,...} in its timeline, and the threads that do it can only see an ordered subset of those values, eg. {1,3,4,8,...} but never {3,1,7,4,....}.

Anyhoo, 29.3.7 is just saying that the total order of seq_cst ops (S) and the total order of an RMW sequence can order against each other.

Some of the standard document points are so redundant, I'm surprised they don't have an entry for the converse :

X is a seq_cst fence, before A
A is an RMW on M
B is an RMW on M, after A
Y is a seq_cst fence, after B

then Y is after X in the total order S
which presumably is also true.

Okay, so next post we'll look at what the fence is *not*.

5/27/2012

05-27-12 - Prefer Push-Restore to Push-Pop

It's a standard paradigm to use a Stack for a subsystem which parallels the execution stack.

Some common examples : profilers that push/pop scopes, logging scopes such as setting tab depth or subsystem, mutex lock scopers, exception handler stacks, etc. Basically the model is at the start of some function you push something on your parallel stack, and pop on exit.

I've only recently realized that this is bad programming. The problem is it's redundant with the execution stack, and any time you have two separate subsystems which must be exactly in sync to work correctly, you have bad code. In particular, it's (unnecessarily) fragile.

To be concrete lets consider a "stack" that's just an integer counting the depth for something. You might naively maintain it like :


Func1()
{
  s_depth ++; // push

  ... other stuff ..
  Func2()

  s_depth --; // pop
}

Now, sophomoric programmers may already be thinking you could use a "scoper" class to do the push/pop for you; as long as you use the scoper that ensures that you do push/pops in pairs, and it eliminates one type of bug, which is accidentally returning without popping.

But really the scoper is just putting a band-aid on the problem. There is a whole other class of bugs that you cannot fix that way - what if Func2() is written wrong and has a mismatched push/pop ? It will fuck up the stack to you. The entire mechanism is fragile because it is vulnerable to inheriting mistakes in any child code.

Now, sophomoric programmers are thinking "of course it inherits mistakes from calling broken code", they think that's just the way code is. They think the way you make working programs is by finding every single bug in every possible execution path and fixing them all. They think fragile code that has to be used "just so" is fine, because they will ensure that the code is used in exactly the right way.

This is very wrong, and it's why so many programs suck so bad. It comes from a great arrogance that is very common in programmers from beginners to experts. It's the (untrue) belief that you can write bug free code. It's the belief that fragile systems without fail-safes and inherent self-protection are okay because you are so great you will use them correctly.

It becomes obvious if we add the invariant checks that test program correctness :


Func1()
{
  int depthBefore = s_depth;
  s_depth ++; // push

  ... other stuff ..
  Func2()

  s_depth --; // pop
  ASSERT( s_depth == depthBefore );
}

The assert checks that after the pop, the stack should be back to where it was before the push.

But if that's the requirement, why not just do that directly? It's much more robust, it's not sensitive to unmatched push/pops in the children. So instead of Push/Pop we should just do Push/Restore :


Func1()
{
  int depthBefore = s_depth++; // push

  ... other stuff ..
  Func2()

  s_depth = depthBefore; // restore
}

A few quick notes on this : 1. Obviously for super-robust code you should run both methods simultaneously and check that they match; in case of mismatch, perhaps prompt the user (footnote *1*), or just prefer the more robust, and 2. it's always a good idea to write out the asserts that check the invariants you believe to be true in the code. Often you will find that the assert is simpler than the code it was checking. 3. the code is more robust if you just go ahead and make the invariant true. eg. you often see something like :


int Func3(int x)
{
  ASSERT( x == 1 || x == 3 );
  int y = x * 4 + 12;
  ASSERT( y == 16 || y == 24 );
  return y;
}

well if you know the parameters and you know what the answer should be, then just fucking return that. If you require a simple invariant, then just fucking make it true. Don't assume it's true without testing it. Asserting is slightly better, but it's still fragile. Just make it true.

Anyhoo, our s_depth example is so trivial let's do a slightly more complex one, a profiler :


struct ProfBlock { tick_t start; tickt_t end; }
vector<ProfBlock> s_profile_vec;

void Profile_Push()
{
  s_profile_vec.push_back();
  s_profile_vec.back().start = tick();
}

void Profile_Pop()
{
  s_profile_vec.back().end = tick();
  // [start,end] for this block is now set, as is stack depth, store it somewhere permanent
  s_profile_vec.pop_back();
}

Func1()
{
  Profile_Push();

  ... other stuff ..
  Func2()

  Profile_Pop();
}

and as noted in the comment, I assume you actually do something with the block once it's recorded, but the details of that are not necessary here.

Anyhoo, this is very standard, and it's sort of okay, but it's fragile. The issue is in Profile_Pop() - you are making a pure leap of faith that back() is actually the element you should be popping.

(in particular, a very common source of bugs or fragility in this type of code is if the Profiler can be enabled & disabled; even if you always use a Push/Pop scoper class to avoid unmatched pairs, people can enable/disable anywhere and that creates a mismatch (assuming you don't continue to do the push/pops when disabled)).

A better way is Push/Retore :


int Profile_Push()
{
  int i = s_profile_vec.size();
  s_profile_vec.push_back();
  s_profile_vec.back().start = tick();
  return i;
}

void Profile_Restore(int i)
{
  s_profile_vec[i].end = tick();
  // [start,end] for this block is now set, as is stack depth, store it somewhere permanent
  s_profile_vec.resize(i);
}

Func1()
{
  int p = Profile_Push();

  ... other stuff ..
  Func2()

  Profile_Restore(p);
}

If you like you can assert in Restore that it's equivalent to a Pop. But crucially, if you are doing the redundant method and asserting, the fallback behavior should be Restore. (in practice for me, once I realized that Push-Restore is what I really want, that opened the door for me to allow Pushes without Pops, so I don't check that Restore is equal to a Pop, I let them be unequal).

What is the fundamental principle that makes this so much more robust that doing push/pop? It's elimination of parallel states that must be kept in sync. That's one of the greatest causes of bugs and should be done whenever possible; I don't do it enough, it's so tempting sometimes and you don't really see the necessity of it (because we all have Programmer Arrogance that makes us think we can keep it sync correctly).

eg. you constantly see code like :


struct Blah
{
  int m_x;

  int m_y;

  // NOTE : m_y must = m_x * 2;
};

At least this example code has a note about the redundant relationship that must be kept in sync, which is better than lots of real production code, but it's still just a bug waiting to happen. The redundant state should be eliminated and any queries for m_y should just go directly to the single variable that is authoritative.

That's obvious, but the stack is the exact same situation. The thing is, the program is maintaining its own stack, the execution stack, so any other stack that mirrors the execution stack is redundant state. You should just use the execution stack, which is what Push/Restore does for you.