5/31/2012

05-31-12 - On C++ Atomic Fences Part 2

Last post we talked about what C++0x fences are. Now we'll look at what they are not.

(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers will change in the future to be more aggressive about allowing relaxed memory ordering).

The C++0x fence is not visible to other threads unless they specifically schedule against the fence. That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of a real CPU fence.

A very common example in old-school code is to use a CPU barrier for publication. Something like this :


Set up stuff
Barrier
Publish stuff

other threads :

if stuff is published
  then reading stuff should see set up

this does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :

atomic<int> x; // stuff
atomic<int> y; // publication flag

x & y initially 0

thread 0 :

    // set up stuff :
    x($).store(1,mo_relaxed);
    // barrier :
    std::atomic_thread_fence(mo_seq_cst,$);
    // publish :
    y($).store(1,mo_relaxed);

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

this does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the fence, for example this would work :

thread 1 :

    // wait for publication :
    rl::backoff bo;
    while ( y($).load(mo_relaxed) == 0 )
      bo.yield($);

    // added -
    std::atomic_thread_fence(mo_acquire,$);

    // read it :
    int xx = x($).load(mo_relaxed);
    RL_ASSERT( xx == 1 );

Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's which creates a sync point for all of them (not exactly true, but effectively true). When thread 1 sees that y is non-zero, then the store to y on thread 0 must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way of connecting a "happens before/after" to it. In this case we can say :

thread 1 reads x after
thread 1 loads y with value == 1 after
thread 0 stores y with value == 1 after
thread 0 does a barrier after
thread 0 stores x

(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization at all, because causality and load-consume takes care of it for you).

Okay. We can see the exact same thing in a more complex example. This is Dmitriy's example Peterson Mutex . Here's my test harness :


struct test7 : rl::test_suite<test7, 2>
{
    // the mutex :
    std::atomic<int>    m_flag[2];
    std::atomic<int>    m_turn;
    // something protected by the mutex :
    std::atomic<int>    m_data;

    void before()
    {
        m_flag[0]($).store(0);
        m_flag[1]($).store(0);
        m_turn($).store(0);
        m_data($).store(0);
    }
    
    void after()
    {
        int d = m_data($).load();
        RL_ASSERT( d == 42 );
    }
    
    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);
        m_turn($).store(other,std::mo_release);

        // ** Barrier here **       
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }
    
    void unlock(unsigned tidx) 
    {
        m_flag[tidx]($).store(0,std::mo_release);
    }
    
    void thread(unsigned tidx) 
    {
        lock(tidx);

        // do m_data += 7; m_data *= 2;
        int d = m_data($).load(std::mo_relaxed);
        m_data($).store(d+7,std::mo_relaxed);
        d = m_data($).load(std::mo_relaxed);
        m_data($).store(d*2,std::mo_relaxed);
        
        unlock(tidx);
    }
};

This is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't actually provide mutual exclusion, eg. the code doesn't do what you expect it to).

The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a seq_cst fence) goes like this :


thread 0 : starts lock
thread 0 : flag[0] = 1
thread 0 : turn = 1

thread 1 : starts lock
thread 1 : flag[1] = 1
thread 1 : turn = 0 ; (overwrites the previous turn=1)
thread 1 : barrier
thread 1 : load flag[0] ; sees 0 - old value (*!)

thread 0 : barrier
thread 0 : load flag[1] ; sees 1
thread 0 : load turn ; sees 0

(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a normal CPU you could say :

load flag[0] is after
barrier, which is after
store to turn = 0, which is after
store to turn = 1, which is after
store flag[0] = 1

therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the later load of turn on thread 0 sees a 0).

So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object. So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn into an RMW :


    void lock(int tidx) 
    {
        int other = tidx^1;
    
        m_flag[tidx]($).store(1,std::mo_relaxed);

        m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW
        
        rl::backoff bo;
        for(;;)
        {
            int f = m_flag[other]($).load(std::mo_acquire);
            int t = m_turn($).load(std::mo_relaxed);
            if ( f && t == other )
            {
                bo.yield($);
                continue;
            }
            break;
        }       
        
    }

and that works (it's the same thing described here : Peterson's lock with C++0x atomics - Just Software Solutions among other places).

Some links on the topic :

Subtle difference between C++0x MM and other MMs
Subtle difference between C++0x MM and other MMs - Page 3
stdatomic_thread_fence - cppreference.com
Relacy finds fence bugs in spsc
Re questions about memory_order_seq_cst fence 2
Re questions about memory_order_seq_cst fence 1
Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1

6 comments:

Anonymous said...

Does this mean any number of people are going to write a lot of atomic-fenced C++ code that works on all current machines, but doesn't obey the spec and thus could break on future machines?

If so, hooray.

cbloom said...

Certainly. Though it's no worse than in the past when people would just write lockfree code against x86 which would rely on all kinds of subtle details of that platform to work.

Really anybody writing lockfree code should have a DEC Alpha or some similar super-weakly-ordered machine to do all their testing on.

But more generally, just because you write your lockfree code in C++0x doesn't mean it's portable. It's easy to rely on behavior which happens to be true but is not guaranteed.

Of course the same thing is true of C, and the C standards committees have always been totally unhelpful about making real code work, and making portability more robust.

For example in plain C, I constantly assume linear addressing in a single address space; according to the standard I can't assume that and a future machine or compiler could break it. I should be allowed to do something like :

#requires(linear_addressing)

and then the assumptions/requirements of the code are clear.

Similarly for lockfree code it would be nice to be able to say

#requires(causal_store_order)

then you get a compile error on platforms that don't give that.

Anonymous said...

Yeah, I agree all that stuff is busted.

I just feel like we should be trying to have less of the "not really portable" standards, not more of them--like C99 deciding to have an explicit rule for integer division rounding.

So it just seems weird for them to pick a brand new convention, especially one that's a little more broken. But yeah, even if they used traditional fences you could still have the DEC Alpha semantic sort of problem go uncaught, so maybe it's not so much C++0x's fault.

cbloom said...

The whole idea of undefined behavior, without any constraints to keep you inside the defined behavior, is totally broken.

The only way to actually make that work would be to provide a compiler + VM that strictly enforces the standard. There should have been a C reference VM that told you any time you used undefined behavior.

The plus side for C++0x is that there is lots of good work on simulators and static race analysis and such going on right now. So I think it will be possible to reasonably test if code is correct according to the standard (then you just have to pray the simulator is actually coded correctly, and the compiler on your new platform implements the standard correctly).

The idea that programmers in the real world will read the standards docs and know what they can or cannot rely on is just insane. But if they have a black box that they can put programs in and get a red light or green light, that's reasonable and real humans will actually do it.

Anonymous said...

Interesting that a CPU fence creates a sync point for all processors.

However, in the "old-school publication" example, I can't get past the fact that the compiler could reorder the flag load with the loads it's supposed to protect. There should at least be a compiler barrier in there, I think. Or publish a pointer.

In C++11 I believe a compiler barrier can be implemented as std::atomic_signal_fence(std::memory_order_acq_rel), but of course, that doesn't still doesn't make the example valid in C++11.

cbloom said...

"Interesting that a CPU fence creates a sync point for all processors."

The exact details of this are highly processor dependent, and I encourage noone to rely on this or to even pursue this line of thought much further.

However, for the record, some more rambling on this topic.

On some platforms the MB actually is an entry in the seq_cst total order. Even though the MB itself is not an observable event on other processors, the other processors can still observe ordering relations to the MB, such as "I know this must be after the MB" and that gives them a transitive way to tap into the total order.

(there's a funny thing about these ops that sometimes just the existence of strictly sequenced ops can affect your core even if you cannot observe them, if you can say that other ops relate to them)

On other processors the MB is not an entry in the total order, however it can still act as a sync point for other processors. The reason for this is more subtle and where I might go wrong a bit. (again I encourage everyone to not read this comment and just do things the safe modern way) (also I ask if there are any cpu experts out there you could fill in some details for me).

The crucial thing is that on all current real cores, the MB communicates via cache line messages. On the core that executes the MB, the store buffer is drained, all previous dirty lines are flushed before any future action can happen. The other cores then all receive these cache line update messages, and you can use these to create a happens relationship; eg. "happens after the cache line messages from the MB".

It's pretty trivial to see that this works on in-order cores. The subtle bit is with out-of-order cores that allow speculative execution. The crucial thing that makes it work on real world chips, is that all major chips at the moment invalidate speculative reads when they receive a cache line invalidate message.

old rants