Comments on cbloom rants: 07-10-11 - Mystery - Do you ever need Total Order (seq_cst) -

"They complete the synchronize-with relations...

2012-05-31T20:00:10.411-07:00

"They complete the synchronize-with relations. Because those are there, no other non-atomics will be reordered across the fences..."

Sorry, I was wrong in that case. In the example I gave, the seq_cst fence does order the neighboring atomics (section 29.3.6), but no synchronize-with relation is completed. So you are right, in that example, any non-atomics could be reordered at will.

I took an interest in your post because I have the...

2012-05-31T10:10:02.224-07:00

I took an interest in your post because I have the same (or similar) question as you: Is there some algorithm where we really have no choice but to use sequentially consistent atomic variables? So far, they only seem to exist to offer a Java-like alternative. So you can plug in examples from Art of Multiprocessor Programming (which is pretty Java-focused) and they will just work. I guess that alone is a useful reason for them to exist, especially from the point of view of the C++ standards committee.

"non-atomics can be essentially reordered at will" -- There are limits to this. If it was true, how could a spinlock protect anything? You'd technically have to wrap all shared memory in atomics (even bitfields) and it would be a huge pain. Anthony talks about this in section 5.3.6 of his book, "Ordering nonatomic operations with atomics." The key (in C++ legalspeak) is that non-atomics are included in the "happens-before" relations of a single thread, and "synchronize-with" is just a way to bridge those relations across threads.

"So according to Relacy, your shared variable...

2012-05-31T08:46:04.285-07:00

"So according to Relacy, your shared variables must be at least relaxed atomics for this to work."

Yes, the standard says nothing about how the std::atomic stuff interacts with non-std::atomic stuff.

"Though, I would be very interested to see an actual compiler & platform where a relaxed atomic int is treated differently from a plain int."

You know GCC or someone will do it someday and break lots of perfectly reasonable code in order to make some synthetic test go 1% faster.

"They complete the synchronize-with relations. Because those are there, no other non-atomics will be reordered across the fences..."

Not sure what you mean there; non-atomics can be essentially reordered at will.

Anyway, I want to take a moment to repeat the original purpose of this post, which is :

Do you ever actually need sequential consistency?

I call out the use case of a seq_cst fence as a "phantom" need for sequential consistency. What we are actually trying to get there is a #StoreLoad, we don't need the sequential consistency part of it at all (which is a much heavier synchronization in theory, even if in practice on current chips they are equal (*)), but there's no way in C++0x to get the one without the other.

(* = there is every reason to believe that in the massively multicore future, full system-wide bus-locking ops like sequential consistency will become highly undesirable)

*** Finally, I think those are the minimum shared ...

2012-05-31T04:51:18.863-07:00

*** Finally, I think those are the minimum shared vars which must be relaxed atomic in that example. They complete the synchronize-with relations. Because those are there, no other non-atomics will be reordered across the fences...

I've seen some ambiguous old info floating aro...

2012-05-30T17:36:26.283-07:00

I've seen some ambiguous old info floating around newsgroups, too.

To be a bit more certain, I checked what Relacy had to say about it:

struct iriw_test : rl::test_suite<iriw_test, 2>
{
    std::atomic<int> X, Y;
    int r1, r2;

    void before()
    {
        X($) = 0;
        Y($) = 0;
    }

    void thread(unsigned index)
    {
        if (0 == index)
        {
            X.store(1, rl::memory_order_relaxed);
            std::atomic_thread_fence(rl::memory_order_seq_cst);
            r1 = Y.load(rl::memory_order_relaxed);
        }
        else
        {
            Y.store(1, rl::memory_order_relaxed);
            std::atomic_thread_fence(rl::memory_order_seq_cst);
            r2 = X.load(rl::memory_order_relaxed);
        }
    }

    void after()
    {
        RL_ASSERT(r1 == 1 || r2 == 1);
    }
};

This runs fine. If you remove the fences, it asserts and tells you one of loads took a value which was not current.

Also, if you change std::atomic to plain int (rl::var), it reports a DATA RACE on one of the stores. So according to Relacy, your shared variables must be at least relaxed atomics for this to work. This is actually consistent with the wording of the standard and Anthony's Dekker example. And it suggests that the current page on cppreference.com gets it wrong. (It says non-atomic accesses get ordered.) Though, I would be very interested to see an actual compiler & platform where a relaxed atomic int is treated differently from a plain int.

PS. I wrote atomic_memory_fence earlier, meant atomic_thread_fence, oops.

"Still, it seems misleading to say "you ...

2012-05-30T15:41:11.572-07:00

"Still, it seems misleading to say "you cannot just translate a #StoreLoad from another language into a fence(seq_cst)". You totally can."

err, yeah I think you're right about that. Either I was mistaken or the wording about fences in the standard got stronger in one of the revisions. The new wording looks different from what I recall reading last.

I get your point about wanting to express StoreLoa...

2012-05-30T13:59:03.412-07:00

I get your point about wanting to express StoreLoad as a minimum guarantee requirement. Still, it seems misleading to say "you cannot just translate a #StoreLoad from another language into a fence(seq_cst)". You totally can. Also, all CPU instructions that perform StoreLoad also obtain the other three barrier effects, so it's not like C++11 missed out on the chance to use a better (single) instruction. FWIW, it's implemented as an XCHG in VS11 Beta x86.

Here are two more posts on the StoreLoad issue : ...

2012-05-30T10:01:02.235-07:00

Here are two more posts on the StoreLoad issue :

http://cbloomrants.blogspot.com/2011/07/07-31-11-example-that-needs-seqcst_31.html

http://cbloomrants.blogspot.com/2011/11/11-28-11-some-lock-free-rambling.html

I think I'll just write a fresh post about fences because writing anything significant in this comment box sucks.

Preshing, good questions. I'll take the secon...

2012-05-29T09:31:40.633-07:00

Preshing, good questions. I'll take the second one first :

I don't mean to imply that you cannot get a #StoreLoad in C++0x. Of course you can, and I do it often in different ways.

What I'm saying is that you cannot *express* a #StoreLoad in C++0x. That is, there's no way to say "the only memory ordering required here is #StoreLoad".

The whole point of memory-order specification is to require the absolute minimum to make a program correct. This is ideal for efficiency, but more than that is the mathematical truth at the heart of the algorithm.

StoreLoad is the biggest omission in C++0x that I ran into on a regular basis in trying to write lockfree code. You often get to a situation where you know all you need to make a certain algorithm correct is to add a StoreLoad order constraint, but you cannot express that in C++0x. So you can either add a full seq_cst fence, or the slightly more minimal thing which I generally prefer is like this :

1. Store
2. want #StoreLoad here
3. Load

Change either #1 or #3 to a Load+Store op (like Exchange) , then use #LoadLoad or #StoreStore at #2.

I have found your low-level threading notes genera...

2012-05-28T22:08:12.111-07:00

I have found your low-level threading notes generally helpful, but I think you've made a couple of errors in this post:

1. You say that a seq_cst fence does not order relaxed memory ops in C++0x. However section 29.8.2 of the standard uses the term "synchronizes with" to say that a release fence enforces ordering with respect to an acquire fence. This page on cppreference.com says the same thing in simpler language. A seq_cst fence is both kinds of fences, and therefore must enforce ordering too.

2. You say there's no way to get a #StoreLoad barrier in C++0x. However this proposal, which I believe was accepted, says the intended implementation of std::atomic_memory_fence(std::memory_order_seq_cst) on a SPARC RMO is #LoadLoad | #LoadStore | #StoreLoad | #StoreStore. And in the post by Anthony which you link, the first four paragraphs of his analysis explain what the seq_cst fences are doing; I don't know what that is if not a StoreLoad.

When you mention "seq_cst fences don't act like you think they do", which of Dmitriy's posts are you referring to? Could you share the link?

Okay, should be fixed. Also added a better simple...

2011-07-13T11:36:39.631-07:00

Okay, should be fixed. Also added a better simpler example that you can actually run.

Yeah of course you're right, it's a terrib...

2011-07-13T09:50:53.850-07:00

Yeah of course you're right, it's a terrible toy example, I'll fix it in the post.

Total order only applies to the memory bus, the threads that interact with it can still have races of course.

In the toy example isn't the following a valid...

2011-07-12T23:18:45.988-07:00

In the toy example isn't the following a valid sequence of ops even with seq_cst?

thread4: A = a;
thread4: B = b;

thread1: a = 1;

thread3: A = a; B = b; C = c; D = d;

thread1: b = 1;

thread2: c = 1;
thread2: d = 1;

thread4: C = c;
thread4: D = d;

that will still give the output:

thread3 :
1,0,0,0

thread4 :
0,0,1,1

I might have misunderstood this completely but I can't see how seq_cst can guarantee that the output of thread 3 and 4 is be the same.

...only if you have multiple agents (threads/cores...

2011-07-12T19:20:53.117-07:00

...only if you have multiple agents (threads/cores) talking to the same MMIO device simultaneously, which in itself seems like a weird thing to do; it only works if the given MMIO port is completely stateless, for once!

ahha! When your writing to memory mapped IO, you n...

2011-07-12T08:22:51.652-07:00

ahha! When your writing to memory mapped IO, you need total order. (a rare occurrence on some archs)