7/10/2011

07-10-11 - Mystery - Do you ever need Total Order (seq_cst) -

The seq_cst memory model ops in C++0x are very mysterious to me. In particular, when exactly do you need them and why? Let's talk about some issues.

First of all, one quirk is that C++0x has no way of expressing a #StoreLoad barrier. Because of that, you often wind up having to use seq_cst mem ops just to get a StoreLoad. (for example in "eventcount" seq_cst is needed, but that's just to get the StoreLoad). But this is rather unfortunate because seq_cst is actually much heavier than StoreLoad. It also enforces "total order".

( there are some more quirks in the details of C++0x ; for example an acq_rel or seq_cst is really only the barrier you think it is if it's an "exchange" (read and write). That is, a store marked acq_rel is not actually acq_rel. Similarly, seq_cst fences don't act like you think they do (see Dmitry's posts on this); in particular you cannot just translate a #StoreLoad from another language into a fence(seq_cst) in C++0x (UPDATE : well, usually you can); also seq_cst in C++0x only schedules against other atomic ops with ordering constraints; they do not schedule against non-atomic memory loads, so in that sense it is not like a memory barrier at all ) (UPDATE : this is basically true, but see clarification in newer posts here : cbloom rants 05-31-12 - On C++ Atomic Fences Part 2 and cbloom rants 05-30-12 - On C++ Atomic Fences ).

Let's compare against a theoretical "all_barriers" op, which has LoadLoad,StoreStore,StoreLoad, and LoadStore barriers on both sides of it. seq_cst is the same thing, but also enforces total order. Why would we ever want that?

What does total order do? Well the classic toy example is something like this :


int a,b,c,d = 0,0,0,0;

thread1 : 
  a = 1;
  b = 1;

thread2 :
  c = 1;
  d = 1;

thread3 :
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

thread4:
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

if all the ops are "all_barriers" then we know that the write to b must be after the write to a, and the write to d must be after the write to c, but the ops to {ab} can schedule against the ops to {cd} in any way - and in particular thread 3 and 4 can see them in different ways. So for example with all_barriers, this is a valid output :

thread3 :
 sees {abcd}
  1,0,0,0

thread4 :
 sees {cdab}
  0,0,1,1

if you use seq_cst the order of the stores on the bus is a single linear order. eg. maybe it's {acbd} or whatever, the point is its the same for all observers. In particular, if we pretend that thread3 and 4 can run instantly and atomically then they would print the same thing.

ADDENDUM : An actual working demonstration goes like this :

shared int a,b = 0,0;
result int c,d = 0,0;

thread1 :
  a = 1;

thread2 : 
  b = 1;

thread3 :
  local int A = a;
  local int B = b;
  if ( A == 1 && B == 0 )
    c = 1;

thread4 :
  local int B = b;
  local int A = a;
  if ( B == 1 && A == 0 )
    d = 1;

// invariant :

if memory order is seq_cst then (c+d) == 2 is impossible
any other memory order and (c+d) == 2 is possible

that is, threads 1 & 2 independently write to a & b. If you use a total order, then thread 3 and 4 must see either {ab} or {ba} - both are valid, but the point is it's the same. If you don't use total order then they could see the order differently.

We then check the order in a race-free way. c=1 can only be set if the order was {ab} , d=1 can only be set if the order was {ba} , therefore with seq_cst it's impossible to see both c=1 and d=1.

(end of addendum).

(note that if you only talk about two threads, then this "total order" issue never comes up, and acq_rel is the same as seq_cst; it only becomes different when three or more threads are watching each other)

But this is quite a weird thing to require; does it ever actually matter?

(just to confuse things, Intel recently changed the spec of the MFENCE instruction to enfore "causal consistency" instead of "sequential consistency" ; presumably this is part of the push to higher core counts and the goal is to get away from the very expensive system-wide sequence point that sequential consistency requires. "causal consistency" provides a total order only for operations which are "causally related" - eg. at the same memory location, or tied together by being used together in an atomic op).

(briefly : in C++0x a seq_cst fence acts to strictly order only other seq_cst ops; it also acts as acq_rel fence; it does not act to order ops that are not seq_cst ; for example it is not a #StoreStore barrier for two relaxes stores, one before and one after the fence (unless they are seq_cst themselves) ; part of the confusion comes from the fact that in practice on most platforms a seq_cst fence is implemented with an instruction like MFENCE which is in fact a stronger barrier - it orders all ops as being strictly before the mfence or strictly after).

(as usual the C++ language lawyer guys are spiraling off into a pit of abstraction and mildly fucking up; I'll do a followup post on how it should have been done).

Dekker's mutex is sometimes cited as an example that needs sequential consistency, because it is doing that sort of weird stuff where it uses multiple variables and needs the writes to them to go in the right order, but it actually doesn't need seq_cst. All it needs are #StoreLoad barriers. ( see here - though I think this code is actually broken)

Unfortunately StoreLoad is very awkward to express in C++0x ; one way to do it is to change the load into an AtomicIncrement by 0, so that it's a read-modify-write (RMW), then you can use a #StoreStore (which a normal "release" constraint). For example Dekker's lock can be expressed like this :


    void lock(int tid)
    {
        int him = tid^1;
        rl::linear_backoff b1;
        
        flag[tid]($).store(1,std::mo_relaxed);

        // #StoreLoad here

        while ( flag[him]($).fetch_add(0,std::mo_acq_rel) )
        {       
            b1.yield($);
            if ( turn($).load(std::mo_relaxed) != tid )
            {
                rl::linear_backoff b2;
                flag[tid]($).store(0,std::mo_relaxed);
                while ( turn($).load(std::mo_relaxed) != tid )
                {
                    b2.yield($);
                }   
                flag[tid]($).store(1,std::mo_relaxed);

                //#StoreLoad here
            }
        }
    }
    
    void unlock(int tid)
    {
        turn($).store( tid^1  , std::mo_release);
        flag[tid]($).store( 0 , std::mo_release);
    }

(backoffs are needed to make Relacy work). Now obviously the RMW is an unfortunate stupid expense, it requires a chunk of CPU time to hold that cache line, but it might be better than a seq_cst op, depending on how that's implemented.

So what I'm struggling with is imagining a situation that actually needs "total order" (and "needs" in a fundamental sense, not just because the coder was a dummy doing silly things). That is, you can easily cook up bad code that needs total order because it is making dumb assumptions about disparate memory ops becoming visible in the same order on different processors, but that's just bad code.

15 comments:

  said...

ahha! When your writing to memory mapped IO, you need total order. (a rare occurrence on some archs)

ryg said...

...only if you have multiple agents (threads/cores) talking to the same MMIO device simultaneously, which in itself seems like a weird thing to do; it only works if the given MMIO port is completely stateless, for once!

Simon Mogensen said...

In the toy example isn't the following a valid sequence of ops even with seq_cst?

thread4: A = a;
thread4: B = b;

thread1: a = 1;

thread3: A = a; B = b; C = c; D = d;

thread1: b = 1;

thread2: c = 1;
thread2: d = 1;

thread4: C = c;
thread4: D = d;

that will still give the output:

thread3 :
1,0,0,0

thread4 :
0,0,1,1

I might have misunderstood this completely but I can't see how seq_cst can guarantee that the output of thread 3 and 4 is be the same.

cbloom said...

Yeah of course you're right, it's a terrible toy example, I'll fix it in the post.

Total order only applies to the memory bus, the threads that interact with it can still have races of course.

cbloom said...

Okay, should be fixed. Also added a better simpler example that you can actually run.

preshing said...

I have found your low-level threading notes generally helpful, but I think you've made a couple of errors in this post:

1. You say that a seq_cst fence does not order relaxed memory ops in C++0x. However section 29.8.2 of the standard uses the term "synchronizes with" to say that a release fence enforces ordering with respect to an acquire fence. This page on cppreference.com says the same thing in simpler language. A seq_cst fence is both kinds of fences, and therefore must enforce ordering too.

2. You say there's no way to get a #StoreLoad barrier in C++0x. However this proposal, which I believe was accepted, says the intended implementation of std::atomic_memory_fence(std::memory_order_seq_cst) on a SPARC RMO is #LoadLoad | #LoadStore | #StoreLoad | #StoreStore. And in the post by Anthony which you link, the first four paragraphs of his analysis explain what the seq_cst fences are doing; I don't know what that is if not a StoreLoad.

When you mention "seq_cst fences don't act like you think they do", which of Dmitriy's posts are you referring to? Could you share the link?

cbloom said...

Preshing, good questions. I'll take the second one first :

I don't mean to imply that you cannot get a #StoreLoad in C++0x. Of course you can, and I do it often in different ways.

What I'm saying is that you cannot *express* a #StoreLoad in C++0x. That is, there's no way to say "the only memory ordering required here is #StoreLoad".

The whole point of memory-order specification is to require the absolute minimum to make a program correct. This is ideal for efficiency, but more than that is the mathematical truth at the heart of the algorithm.

StoreLoad is the biggest omission in C++0x that I ran into on a regular basis in trying to write lockfree code. You often get to a situation where you know all you need to make a certain algorithm correct is to add a StoreLoad order constraint, but you cannot express that in C++0x. So you can either add a full seq_cst fence, or the slightly more minimal thing which I generally prefer is like this :

1. Store
2. want #StoreLoad here
3. Load

Change either #1 or #3 to a Load+Store op (like Exchange) , then use #LoadLoad or #StoreStore at #2.

cbloom said...

Here are two more posts on the StoreLoad issue :

http://cbloomrants.blogspot.com/2011/07/07-31-11-example-that-needs-seqcst_31.html

http://cbloomrants.blogspot.com/2011/11/11-28-11-some-lock-free-rambling.html

I think I'll just write a fresh post about fences because writing anything significant in this comment box sucks.

preshing said...

I get your point about wanting to express StoreLoad as a minimum guarantee requirement. Still, it seems misleading to say "you cannot just translate a #StoreLoad from another language into a fence(seq_cst)". You totally can. Also, all CPU instructions that perform StoreLoad also obtain the other three barrier effects, so it's not like C++11 missed out on the chance to use a better (single) instruction. FWIW, it's implemented as an XCHG in VS11 Beta x86.

cbloom said...

"Still, it seems misleading to say "you cannot just translate a #StoreLoad from another language into a fence(seq_cst)". You totally can."

err, yeah I think you're right about that. Either I was mistaken or the wording about fences in the standard got stronger in one of the revisions. The new wording looks different from what I recall reading last.

preshing said...

I've seen some ambiguous old info floating around newsgroups, too.

To be a bit more certain, I checked what Relacy had to say about it:

struct iriw_test : rl::test_suite<iriw_test, 2>
{
    std::atomic<int> X, Y;
    int r1, r2;

    void before()
    {
        X($) = 0;
        Y($) = 0;
    }

    void thread(unsigned index)
    {
        if (0 == index)
        {
            X.store(1, rl::memory_order_relaxed);
            std::atomic_thread_fence(rl::memory_order_seq_cst);
            r1 = Y.load(rl::memory_order_relaxed);
        }
        else
        {
            Y.store(1, rl::memory_order_relaxed);
            std::atomic_thread_fence(rl::memory_order_seq_cst);
            r2 = X.load(rl::memory_order_relaxed);
        }
    }

    void after()
    {
        RL_ASSERT(r1 == 1 || r2 == 1);
    }
};

This runs fine. If you remove the fences, it asserts and tells you one of loads took a value which was not current.

Also, if you change std::atomic to plain int (rl::var), it reports a DATA RACE on one of the stores. So according to Relacy, your shared variables must be at least relaxed atomics for this to work. This is actually consistent with the wording of the standard and Anthony's Dekker example. And it suggests that the current page on cppreference.com gets it wrong. (It says non-atomic accesses get ordered.) Though, I would be very interested to see an actual compiler & platform where a relaxed atomic int is treated differently from a plain int.

PS. I wrote atomic_memory_fence earlier, meant atomic_thread_fence, oops.

preshing said...

*** Finally, I think those are the minimum shared vars which must be relaxed atomic in that example. They complete the synchronize-with relations. Because those are there, no other non-atomics will be reordered across the fences...

cbloom said...

"So according to Relacy, your shared variables must be at least relaxed atomics for this to work."

Yes, the standard says nothing about how the std::atomic stuff interacts with non-std::atomic stuff.

"Though, I would be very interested to see an actual compiler & platform where a relaxed atomic int is treated differently from a plain int."

You know GCC or someone will do it someday and break lots of perfectly reasonable code in order to make some synthetic test go 1% faster.

"They complete the synchronize-with relations. Because those are there, no other non-atomics will be reordered across the fences..."

Not sure what you mean there; non-atomics can be essentially reordered at will.


Anyway, I want to take a moment to repeat the original purpose of this post, which is :

Do you ever actually need sequential consistency?

I call out the use case of a seq_cst fence as a "phantom" need for sequential consistency. What we are actually trying to get there is a #StoreLoad, we don't need the sequential consistency part of it at all (which is a much heavier synchronization in theory, even if in practice on current chips they are equal (*)), but there's no way in C++0x to get the one without the other.


(* = there is every reason to believe that in the massively multicore future, full system-wide bus-locking ops like sequential consistency will become highly undesirable)

preshing said...

I took an interest in your post because I have the same (or similar) question as you: Is there some algorithm where we really have no choice but to use sequentially consistent atomic variables? So far, they only seem to exist to offer a Java-like alternative. So you can plug in examples from Art of Multiprocessor Programming (which is pretty Java-focused) and they will just work. I guess that alone is a useful reason for them to exist, especially from the point of view of the C++ standards committee.

"non-atomics can be essentially reordered at will" -- There are limits to this. If it was true, how could a spinlock protect anything? You'd technically have to wrap all shared memory in atomics (even bitfields) and it would be a huge pain. Anthony talks about this in section 5.3.6 of his book, "Ordering nonatomic operations with atomics." The key (in C++ legalspeak) is that non-atomics are included in the "happens-before" relations of a single thread, and "synchronize-with" is just a way to bridge those relations across threads.

preshing said...

"They complete the synchronize-with relations. Because those are there, no other non-atomics will be reordered across the fences..."

Sorry, I was wrong in that case. In the example I gave, the seq_cst fence does order the neighboring atomics (section 29.3.6), but no synchronize-with relation is completed. So you are right, in that example, any non-atomics could be reordered at will.

old rants