(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers will change in the future to be more aggressive about allowing relaxed memory ordering).
The C++0x fence is not visible to other threads unless they specifically schedule against the fence. That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of a real CPU fence.
A very common example in old-school code is to use a CPU barrier for publication. Something like this :
Set up stuff Barrier Publish stuff other threads : if stuff is published then reading stuff should see set upthis does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :
atomicthis does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the fence, for example this would work :<
int> x; // stuff atomic<
int> y; // publication flag x & y initially 0 thread 0 : // set up stuff : x($).store(1,mo_relaxed); // barrier : std::atomic_thread_fence(mo_seq_cst,$); // publish : y($).store(1,mo_relaxed); thread 1 : // wait for publication : rl::backoff bo; while ( y($).load(mo_relaxed) == 0 ) bo.yield($); // read it : int xx = x($).load(mo_relaxed); RL_ASSERT( xx == 1 );
thread 1 : // wait for publication : rl::backoff bo; while ( y($).load(mo_relaxed) == 0 ) bo.yield($); // added - std::atomic_thread_fence(mo_acquire,$); // read it : int xx = x($).load(mo_relaxed); RL_ASSERT( xx == 1 );Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's which creates a sync point for all of them (not exactly true, but effectively true). When thread 1 sees that y is non-zero, then the store to y on thread 0 must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way of connecting a "happens before/after" to it. In this case we can say :
thread 1 reads x after thread 1 loads y with value == 1 after thread 0 stores y with value == 1 after thread 0 does a barrier after thread 0 stores x
(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization at all, because causality and load-consume takes care of it for you).
Okay. We can see the exact same thing in a more complex example. This is Dmitriy's example Peterson Mutex . Here's my test harness :
struct test7 : rl::test_suiteThis is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't actually provide mutual exclusion, eg. the code doesn't do what you expect it to).<
test7, 2> { // the mutex : std::atomic<
int> m_flag[2]; std::atomic<
int> m_turn; // something protected by the mutex : std::atomic<
int> m_data; void before() { m_flag[0]($).store(0); m_flag[1]($).store(0); m_turn($).store(0); m_data($).store(0); } void after() { int d = m_data($).load(); RL_ASSERT( d == 42 ); } void lock(int tidx) { int other = tidx^1; m_flag[tidx]($).store(1,std::mo_relaxed); m_turn($).store(other,std::mo_release); // ** Barrier here ** rl::backoff bo; for(;;) { int f = m_flag[other]($).load(std::mo_acquire); int t = m_turn($).load(std::mo_relaxed); if ( f && t == other ) { bo.yield($); continue; } break; } } void unlock(unsigned tidx) { m_flag[tidx]($).store(0,std::mo_release); } void thread(unsigned tidx) { lock(tidx); // do m_data += 7; m_data *= 2; int d = m_data($).load(std::mo_relaxed); m_data($).store(d+7,std::mo_relaxed); d = m_data($).load(std::mo_relaxed); m_data($).store(d*2,std::mo_relaxed); unlock(tidx); } };
The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a seq_cst fence) goes like this :
thread 0 : starts lock thread 0 : flag[0] = 1 thread 0 : turn = 1 thread 1 : starts lock thread 1 : flag[1] = 1 thread 1 : turn = 0 ; (overwrites the previous turn=1) thread 1 : barrier thread 1 : load flag[0] ; sees 0 - old value (*!) thread 0 : barrier thread 0 : load flag[1] ; sees 1 thread 0 : load turn ; sees 0(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a normal CPU you could say :
load flag[0] is after barrier, which is after store to turn = 0, which is after store to turn = 1, which is after store flag[0] = 1therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the later load of turn on thread 0 sees a 0).
So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object. So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn into an RMW :
void lock(int tidx) { int other = tidx^1; m_flag[tidx]($).store(1,std::mo_relaxed); m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW rl::backoff bo; for(;;) { int f = m_flag[other]($).load(std::mo_acquire); int t = m_turn($).load(std::mo_relaxed); if ( f && t == other ) { bo.yield($); continue; } break; } }and that works (it's the same thing described here : Peterson's lock with C++0x atomics - Just Software Solutions among other places).
Some links on the topic :
Subtle difference between C++0x MM and other MMs
Subtle difference between C++0x MM and other MMs - Page 3
stdatomic_thread_fence - cppreference.com
Relacy finds fence bugs in spsc
Re questions about memory_order_seq_cst fence 2
Re questions about memory_order_seq_cst fence 1
Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2
C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1