5/30/2012

05-30-12 - On C++ Atomic Fences

C++0x's atomic_thread_fence is weird. Preshing asked some questions and pointed out some errors in this blog post which has got me to look into it again.

(caveat : I'm no standards-reading afficionado, and I find the C++0x rules very odd, so this is as much me learning out loud as anything).

I'm going to do this post a bit backwards; first some random notes.

1. I never use C++0x fences. In all the lockfree atomic code I've written I've never used them or wanted them. I put the memory ordering on the operations; it's easier to understand and usually more efficient (because doing something like making an exchange be acq_rel is a no-op on most platforms, exchange is already acq_rel, whereas adding another fence requires another hard sync to be issued because compilers are not yet sophisticated enough to merge atomic ordering operations).

The only case I have ever seen where a fence is better than putting the constraint on the operation is for the optimization of doing a relaxed op to detect that you need the ordering. Something like :


if ( atomic_ticket.load( mo_relaxed ) == me )
{
  std::atomic_thread_fence( mo_acquire );  // make previous load act like acquire
  ... do stuff on my ticket ...
}

is faster than :

if ( atomic_ticket.load( mo_acquire ) == me )
{
  ... do stuff on my ticket ...
}

this can be used for example with a ref counting destructor; you don't actually need the "acquire" until the refs go to zero. Which brings us to the next note :

2. The way to understand C++0x fences is to forget everything you know about CPU's, caches, what the CPU memory fence instructions do, any connotation in your head about what "barrier" or "fence" means. The people who are most confused about it are the ones who had some background writing lockfree assembly in the pre-C++0x days, because C++0x fences are really not what you are used to.

What C++0x fences really do is provide more sync points through which other C++0x atomic ops can create "happens before" relationships. You can heuristically think of them as modifying the neighboring ops, not as being an independent operation themselves. In particular :


An "acquire" fence can make a preceding load act like a load_acquire
A "release" fence can make a following store act like a store_release

(an acq_rel fence obviously does both)

A "seq_cst" fence provides an entry in the program total order ("S")
  then preceding loads & following stores can be located relative to that point in the order S
  (eg. either "happens before" or "happens after")

Actually the fences are rather like the old-school way of marking up memory ordering constraints. eg. instead of :

x.load( mo_acquire );  // #LoadLoad follows acquire load

you used to write :

x.load();
#LoadLoad

which is more like the fence method in C++0x :

x.load( mo_relaxed );
fence( mo_acquire ); // #LoadLoad

Errkay so let's get into more specifics.


Section 29.8 of C++ doc N3337 :

29.8.2 : A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifies M, after A

Y is an atomic op on M, Y reads the value set at X, before B
B is an acquire fence


is this :

    m_x and m_y initially zero
    
    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$); // "A"
            m_y($).store(1,std::mo_relaxed); // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "Y"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Roughly what this says is the "release" ordering of a fence synchronizes with the "acquire" ordering of another fence if there is a shared variable ("M") that connects the two threads, after the release and before the acquire. Or, if you like, the release fence before the store to M makes the store act like a store-release, and the acquire fence after the load of M makes the load of M act like a load-acquire.

(just using a store-release and a load-acquire is the normal way of doing a publish/consume pattern like this)

29.8.3 : A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.


A is a release fence
X is an atomic op on M, X modifes M, after A

A synchronizes with B :

B is an atomic op on M, reads the value written by X


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            std::atomic_thread_fence(std::mo_release,$);        // "A"
            m_y($).store(1,std::mo_relaxed);                    // "X" , m_y is "M"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }

            int y = m_y($).load(std::mo_acquire);  // "B" synchronizes with "A"
            RL_ASSERT( y == 1 );
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

This just says the same thing but that you can substitude the acquire fence for a load-acquire. That is, a release fence can synchronize with a load-acquire just as if it was a store-release.

29.8.4 : An atomic operation A that is a release operation on an atomic object M synchronizes with an acquire fence B if there exists some atomic operation X on M such that X is sequenced before B and reads the value written by A or a value written by any side effect in the release sequence headed by A.


A is an atomic release op on object M

X is an atomic op on M, before B, reads from A
B is an acquire fence


    void thread(unsigned tidx) 
    {
        if ( tidx == 0 )
        {
            m_x($).store(1,std::mo_relaxed);
            m_y($).store(1,std::mo_release); // "A"
        }
        else
        {
            while ( m_y($).load(std::mo_relaxed) == 0 )
            {
            }
            // we just read a 1 from m_y // "X"
            
            std::atomic_thread_fence(std::mo_acquire,$); // "B"
            
            int x = m_x($).load(std::mo_relaxed);
            RL_ASSERT( x == 1 );
        }
    }

Again the same thing, just saying an acquire fence can synchronize with a store-release.

That's about for interesting stuff in 29.8 , there's a bit more in 29.3 on fences :

29.3.5 : For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
B reads M , after X

Then B cannot see a value of M from before A. 
Well, duh. Note that of course this is only non-trivial when A and B are done on different threads. And B being "after X" usually means B is after something else in the total order S which you know to be after X.

This sort of says that a seq_cst fence acts as a #StoreLoad. Note however that both A and B must have "happens before/after" relationships with the seq_cst fence. If only one of them has that relation then it doesn't work. We'll revisit this in the next post when we talk about how seq_cst fence doesn't behave exactly as you think.

29.3.6 For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

A modifies M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B reads M , after Y

Well, duh.

29.3.7 For atomic operations A and B on an atomic object M, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B occurs later than A in the modification order of M.

A is an RMW on M , before X
X is a seq_cst fence
Y is a seq_cst fence, Y is after X
B is an RMW on M , after Y

then B is after A on the RMW order of M.

There are two forms of very strong ordering guaranteed in C++0x. One is the total order (called "S") of seq_cst ops (aside : I believe that seq_cst fences count as independent entries in the list S, but I don't see anywhere that the standard actually says that). The other is the modification order of an atomic object. There is guaranteed to be a single consistent modification order to every atomic, though not all threads may see the same order, depending on how they view it. Furthermore, any Store without a read (not an RMW) clobbers the modification order, because it wipes out the history and makes it impossible to know what the order before the store was. But you can say something strong : as long as you only use RMW ops on a variable, then it has a single global order of changes, and every thread can only view those changes in order.

eg. say a bunch of threads are doing RMW Increment ops on a shared variable. Then the variable must go {1,2,3,...} in its timeline, and the threads that do it can only see an ordered subset of those values, eg. {1,3,4,8,...} but never {3,1,7,4,....}.

Anyhoo, 29.3.7 is just saying that the total order of seq_cst ops (S) and the total order of an RMW sequence can order against each other.

Some of the standard document points are so redundant, I'm surprised they don't have an entry for the converse :

X is a seq_cst fence, before A
A is an RMW on M
B is an RMW on M, after A
Y is a seq_cst fence, after B

then Y is after X in the total order S
which presumably is also true.

Okay, so next post we'll look at what the fence is *not*.

7 comments:

Unknown said...

Potentially silly question: what's "($)"?

Fabian 'ryg' Giesen said...

That's a bit of Relacy magic to make the automatic enumeration of possible event orderings work.

cbloom said...

Yeah, you can just pretend the $ is not there. It gets #define'd to different things for testing.

My own C++0x-look-alike atomic layer accepts the $ as an additional argument and does nothing with it, so that I can move code between Relacy and production without touching it.

(I also have an in-place Relacy-like test layer which uses real threads and real atomic ops, and in that case the $ does randomized thread switches and stalls to stress the code).

Brian said...

I get the feeling that the C++ spec writers added fences more as an afterthought.

A nitpick... An acquire fence makes some preceding load act like a load_acquire (the immediately preceding load might not access the same memory address as a store_release ).

One thing I am curious about that the spec doesn't talk much about--- when is the compiler allowed to introduce a deadlock? Let's supposed that we have two threads that each publicize something and then wait for the other thread to publicize. The C++ spec isn't very clear on whether the compiler can reorder the load_acquires in the wait for publication loop above the store release.

Brian said...

Guess that isn't quite right either since the synchronization happens at the fence and not preceding the load..

Anonymous said...

These posts are helpful. You should add them to your table of contents for low-level threading.

cbloom said...

Word up. Made new TOC :

http://cbloomrants.blogspot.com/2012/06/06-12-12-another-threading-post-index.html

old rants