7/10/2011

07-10-11 - Mystery - Do you ever need Total Order (seq_cst) -

The seq_cst memory model ops in C++0x are very mysterious to me. In particular, when exactly do you need them and why? Let's talk about some issues.

First of all, one quirk is that C++0x has no way of expressing a #StoreLoad barrier. Because of that, you often wind up having to use seq_cst mem ops just to get a StoreLoad. (for example in "eventcount" seq_cst is needed, but that's just to get the StoreLoad). But this is rather unfortunate because seq_cst is actually much heavier than StoreLoad. It also enforces "total order".

( there are some more quirks in the details of C++0x ; for example an acq_rel or seq_cst is really only the barrier you think it is if it's an "exchange" (read and write). That is, a store marked acq_rel is not actually acq_rel. Similarly, seq_cst fences don't act like you think they do (see Dmitry's posts on this); in particular you cannot just translate a #StoreLoad from another language into a fence(seq_cst) in C++0x ; also seq_cst in C++0x only schedules against other atomic ops with ordering constraints; they do not schedule against relaxed atomic ops or non-atomic memory loads, so in that sense it is not like a memory barrier at all ).

Let's compare against a theoretical "all_barriers" op, which has LoadLoad,StoreStore,StoreLoad, and LoadStore barriers on both sides of it. seq_cst is the same thing, but also enforces total order. Why would we ever want that?

What does total order do? Well the classic toy example is something like this :


int a,b,c,d = 0,0,0,0;

thread1 : 
  a = 1;
  b = 1;

thread2 :
  c = 1;
  d = 1;

thread3 :
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

thread4:
  A = a; B = b; C = c; D = d;
  print(A,B,C,D);

if all the ops are "all_barriers" then we know that the write to b must be after the write to a, and the write to d must be after the write to c, but the ops to {ab} can schedule against the ops to {cd} in any way - and in particular thread 3 and 4 can see them in different ways. So for example with all_barriers, this is a valid output :

thread3 :
 sees {abcd}
  1,0,0,0

thread4 :
 sees {cdab}
  0,0,1,1

if you use seq_cst the order of the stores on the bus is a single linear order. eg. maybe it's {acbd} or whatever, the point is its the same for all observers. In particular, if we pretend that thread3 and 4 can run instantly and atomically then they would print the same thing.

ADDENDUM : An actual working demonstration goes like this :

shared int a,b = 0,0;
result int c,d = 0,0;

thread1 :
  a = 1;

thread2 : 
  b = 1;

thread3 :
  local int A = a;
  local int B = b;
  if ( A == 1 && B == 0 )
    c = 1;

thread4 :
  local int B = b;
  local int A = a;
  if ( B == 1 && A == 0 )
    d = 1;

// invariant :

if memory order is seq_cst then (c+d) == 2 is impossible
any other memory order and (c+d) == 2 is possible

that is, threads 1 & 2 independently write to a & b. If you use a total order, then thread 3 and 4 must see either {ab} or {ba} - both are valid, but the point is it's the same. If you don't use total order then they could see the order differently.

We then check the order in a race-free way. c=1 can only be set if the order was {ab} , d=1 can only be set if the order was {ba} , therefore with seq_cst it's impossible to see both c=1 and d=1.

(end of addendum).

(note that if you only talk about two threads, then this "total order" issue never comes up, and acq_rel is the same as seq_cst; it only becomes different when three or more threads are watching each other)

But this is quite a weird thing to require; does it ever actually matter?

(just to confuse things, Intel recently changed the spec of the MFENCE instruction to enfore "causal consistency" instead of "sequential consistency" ; presumably this is part of the push to higher core counts and the goal is to get away from the very expensive system-wide sequence point that sequential consistency requires. "causal consistency" provides a total order only for operations which are "causally related" - eg. at the same memory location, or tied together by being used together in an atomic op).

(briefly : in C++0x a seq_cst fence acts to strictly order only other seq_cst ops; it also acts as acq_rel fence; it does not act to order ops that are not seq_cst ; for example it is not a #StoreStore barrier for two relaxes stores, one before and one after the fence (unless they are seq_cst themselves) ; part of the confusion comes from the fact that in practice on most platforms a seq_cst fence is implemented with an instruction like MFENCE which is in fact a stronger barrier - it orders all ops as being strictly before the mfence or strictly after).

(as usual the C++ language lawyer guys are spiraling off into a pit of abstraction and mildly fucking up; I'll do a followup post on how it should have been done).

Dekker's mutex is sometimes cited as an example that needs sequential consistency, because it is doing that sort of weird stuff where it uses multiple variables and needs the writes to them to go in the right order, but it actually doesn't need seq_cst. All it needs are #StoreLoad barriers. ( see here - though I think this code is actually broken)

Unfortunately StoreLoad is very awkward to express in C++0x ; one way to do it is to change the load into an AtomicIncrement by 0, so that it's a read-modify-write (RMW), then you can use a #StoreStore (which a normal "release" constraint). For example Dekker's lock can be expressed like this :


    void lock(int tid)
    {
        int him = tid^1;
        rl::linear_backoff b1;
        
        flag[tid]($).store(1,std::mo_relaxed);

        // #StoreLoad here

        while ( flag[him]($).fetch_add(0,std::mo_acq_rel) )
        {       
            b1.yield($);
            if ( turn($).load(std::mo_relaxed) != tid )
            {
                rl::linear_backoff b2;
                flag[tid]($).store(0,std::mo_relaxed);
                while ( turn($).load(std::mo_relaxed) != tid )
                {
                    b2.yield($);
                }   
                flag[tid]($).store(1,std::mo_relaxed);

                //#StoreLoad here
            }
        }
    }
    
    void unlock(int tid)
    {
        turn($).store( tid^1  , std::mo_release);
        flag[tid]($).store( 0 , std::mo_release);
    }

(backoffs are needed to make Relacy work). Now obviously the RMW is an unfortunate stupid expense, it requires a chunk of CPU time to hold that cache line, but it might be better than a seq_cst op, depending on how that's implemented.

So what I'm struggling with is imagining a situation that actually needs "total order" (and "needs" in a fundamental sense, not just because the coder was a dummy doing silly things). That is, you can easily cook up bad code that needs total order because it is making dumb assumptions about disparate memory ops becoming visible in the same order on different processors, but that's just bad code.

5 comments:

  said...

ahha! When your writing to memory mapped IO, you need total order. (a rare occurrence on some archs)

ryg said...

...only if you have multiple agents (threads/cores) talking to the same MMIO device simultaneously, which in itself seems like a weird thing to do; it only works if the given MMIO port is completely stateless, for once!

Simon Mogensen said...

In the toy example isn't the following a valid sequence of ops even with seq_cst?

thread4: A = a;
thread4: B = b;

thread1: a = 1;

thread3: A = a; B = b; C = c; D = d;

thread1: b = 1;

thread2: c = 1;
thread2: d = 1;

thread4: C = c;
thread4: D = d;

that will still give the output:

thread3 :
1,0,0,0

thread4 :
0,0,1,1

I might have misunderstood this completely but I can't see how seq_cst can guarantee that the output of thread 3 and 4 is be the same.

cbloom said...

Yeah of course you're right, it's a terrible toy example, I'll fix it in the post.

Total order only applies to the memory bus, the threads that interact with it can still have races of course.

cbloom said...

Okay, should be fixed. Also added a better simpler example that you can actually run.

old rants