7/31/2010

07-31-10 - GCC Scheduling Barrier

When implementing lock-free threading, you sometimes need a compiler scheduling barrier, which is weaker than a CPU instruction scheduling barrier, or a cache temporal ordering memory barrier.

There's a common belief that an empty volatile asm in GCC is a scheduling barrier :


 __asm__ volatile("")

but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec). What I believe it does do is it splits the "basic blocks" for compilation, but then after initial optimization there's another merge pass where basic blocks are combined and they can in fact then schedule against each other.

The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to give gaurantees : gcc.gnu.org/bugzilla/show_bug.cgi?id=17884 or gcc-patches/2004-10/msg01048.html . In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile, but they claim that's a documentation error.

Now they also have the built-in "__sync" stuff . But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.

__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also do not document the semantics well. They say :

"In most cases, these builtins are considered a full barrier"

What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ? ( there also seem to be some bugs in some of the __sync implementations bugzilla/show_bug.cgi?id=36793 )

For example on PS3 __sync_val_compare_and_swap appears to be :


  sync
  lwarx
   .. loop ..
  stwcx
  isync

which means it is a full Seq_Cst operation like a lock xchg on x86. That would be cool if I actually knew it always was that on all platforms, but failing to clearly document the gauranteed memory semantics of the __sync operations makes them dangerous. (BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire unlike Xenon which uses lwsync in that spot).

(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all, which pursuant to previous blog Mystery - Does the Cell PPU need Memory Control might actually be the correct thing).

I also stumbled on this thread where Linus rants about GCC .

I think the example someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually might be wrapping signed ints in your code right now and not know about it. Say I have some code like :

char x;
x -= y;
x += z;
What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190 which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do the right thing, but under gcc that is not gauranteed.

See details on signed-integer-overflow in GCC and notes about wrapv and wrapv vs strict-overflow

15 comments:

cbloom said...

BTW the most compelling argument for signed int wrapping that I know is simply to write this :

a + b + c

what does that do? Is

(a + b) + c

the same as

a + (b + c)

? If signed ints don't wrap, the answer is no.

Jon Olick said...
This comment has been removed by the author.
Jon Olick said...
This comment has been removed by the author.
Jon Olick said...

I've never noticed this phenomenon on PowerPC architectures. Of course I've never looked for it either. I have done tons of inline assembly though.

I take it that this has occurred for you and is causing problems?

cbloom said...

"I take it that this has occurred for you and is causing problems? "

No, so far as I can tell *in practice* on our current consoles, it actually is a scheduling barrier.

However, it disturbs me to make that part of my library when the GCC devs specifically say it is not a barrier.

I can just imagine the nightmare threading bugs that would occur if an instruction was moved there.

Sigh.

Brian said...

I thought the GCC barrier was:
asm volatile("":::"memory");

sylvain-v said...

asm volatile("":::"memory") will indeed force the barrier, but it also seems to discard previously loaded memory and to reload them, degrading performances. There might be a non-documented asm volatile("") - see http://gcc.gnu.org/ml/gcc/2003-04/msg01180.html

cbloom said...

sylvain, in that very thread they say :

[quote]
> I observed that simply asm volatile (""), i.e., "old style asm" *does*
> do what I want.

That seems wrong to me. In any case, you definitely shouldn't
rely on this.
[/quote]

which is what I've been seeing. Currently asm volatile *does* seem to be what we want, but they are very defensive about not gauranteeing it. Sigh.

Brian said...

So what semantics do you want for such a statement? A full memory barrier is fine, it has well defined semantics. Just not reordering statements but allowing the compiler to still use values it has stored has pretty messy semantics.

cbloom said...

" So what semantics do you want for such a statement? A full memory barrier is fine, it has well defined semantics. Just not reordering statements but allowing the compiler to still use values it has stored has pretty messy semantics. "

The "memory" mark on an asm volatile is not a "memory barrier" ; it tells the compiler that all memory may have been trashed so everything has to be reloaded. That is needlessly inefficient. Basically it makes all your variables into volatiles. Note that it's also *insufficient* , if you actually want a real memory barrier, it does not generate one. So it's hard for me to imagine when you would want that behavior.

What I want is simply a "no compiler reorder". Of course that's pretty useless when you're just writing standard C since it has undefined memory semantics, but when I'm doing things that I know have very specific actual memory semantics, all I need to know is that the compiler won't reorder the things that I have very carefully ordered.

For example, LoadAcquire on x86 can be implemented as

x = *ptr
CompilerReadBarrier

Brian said...

So what you really want is to denote a set of data that you need to explicitly control the access order to (and for which your ordering rules should apply to) and then provide barriers that limit reordering of accesses to just that state and no other state.
On 32-bit x86, is the "memory" designation really that expensive. At worse you will spill a handful of registers...

cbloom said...

" So what you really want is to denote a set of data that you need to explicitly control the access order to (and for which your ordering rules should apply to) and then provide barriers that limit reordering of accesses to just that state and no other state. "

Well, yes, in theory, though in practice that's too much intellectual burden, so we just say "order everything".

" On 32-bit x86, is the "memory" designation really that expensive. At worse you will spill a handful of registers. "

Yeah I suspect that in fact the memory volatile is not a huge hit, but more investigation is needed.

Brian said...

If you don't want to make the compiler spill all of the state out of registers at a scheduling barrier (i.e. "memory" designation), you have to tell it what state matters.

Otherwise, it might inline some other methods or something and the CSE pass might then eliminate your read by reusing a stored value.

cbloom said...

No.

Brian said...

I'm probably misunderstanding what semantics you want then. You don't want something that prevents moving reads forward. What you really want is a barrier that prevents moving stores down or up, pushing reads down, or moving volatile reads up?

The real problem with GCC is it wasn't designed to support this and we're really just subverting the mechanism for inline assembly to kind of get what we want.

old rants