cbloom rants: 07-14-11 - compare_exchange_strong vs compare_exchange

7/14/2011

07-14-11 - compare_exchange_strong vs compare_exchange_weak

The C++0x standard was revised a while ago to split compare_exchange (aka CAS) into two ops. A quick note on the difference :


bool compare_exchange_weak( T * ptr, T * old, T new );

bool compare_exchange_strong( T * ptr, T * old, T new );

(BTW in the standard "old" is actually a reference, which is a damn shitty thing to do because it makes it very non-transparent that "old" gets mutated by these functions, so I am showing it as a pointer).

both try to do :


atomically {
  if ( *ptr == *old ) { *ptr = new; return true; }
  else { *old = *ptr; return false; }
}

the difference is that compare_exchange_weak can also return false for spurious failure. (the original C++0x definition of CAS always allowed spurious failure; the new thing is the _strong version which doesn't).

If it returns due to spurious failure, then *old might be left untouched (and in fact, *ptr might be equal to *old but we failed anyway).

If spurious failure can only occur due to contention, then you can still gaurantee progress. In fact in the real world, I believe that LL-SC architectures cannot gaurantee progress, because you can get spurious failure if there is contention anywhere on the cache line, and you need that contention to be specifically on your atomic variable to gaurantee progress. (I guess if you are really worried about this, then you should ensure that atomic variables are padded so they get their own cache line, which is generally good practice for performance reasons anyway).

On "cache line lock" type architectures like x86, there is no such thing as spurious failure. compare_exchange just maps to "cmpxchg" instruction and you always get the swap that you want. (it can still fail of course, if the value was not equal to the old value, but it will reload old). (BTW it's likely that x86 will move away from this in the future, because it's very expensive for very high core counts)

compare_exchange_weak exists for LL-SC (load linked/store conditional) type architectures (Power, ARM, basically everything except x86), because on them compare_exchange_strong must be implemented as a loop, while compare_exchange_weak can be non-looping. For example :

On ARM, compare_exchange_weak is something like :

compare_exchange_weak:

  ldrex     // load with reservation
  teq       // test equality
  strexeq   // store if equal

and strexeq can fail for two reasons - either because they weren't equal, or because the reservation was lost (because someone else touched our cache line).

To implement compare_exchange_strong you need a loop :

compare_exchange_strong:

  while ( ! compare_exchange_weak(ptr,old,new) ) { }

(note that you might be tempted to put a (*old = *ptr) inside the loop, but that's probably not a good idea, and not necessary, because compare_exchange_weak will eventually load *ptr into *old itself when it doesn't fail spuriously).

The funny bit is that when you use compare_exchange you often loop anyway. For example say I want to use compare_exchange_strong to increment a value, I have to do :


cur = *ptr;
while( ! compare_exchange_strong(ptr,&cur,cur+1) ) { }

(note it's a little subtle that this works - when compare_exchange_strong fails, it's because somebody else touched *ptr, so we then reload cur (this is why "old" is passed by address), so you then recompute cur+1 from the new value; so with the compare_exchange_strong, cur has a different value each time around this loop.)

But on an LL-SC architecture like ARM this becomes a loop on a loop, which is dumb when you could get the same result with a single loop :


cur = *ptr;
while( ! compare_exchange_weak(ptr,&cur,cur+1) ) { }

Note that with this loop now cur does *not* always take a new value each time around the loop (it does when it fails due to contention, but not when it fails just due to reservation-lost), but the end result is the same.

So that's why compare_exchange_weak exists, but you might ask why compare_exchange_strong exists. If we always use loops like this, then there's no need for it. But we don't always use loops like this, or we might want to loop at the much higher level. For example you might have something like :

bool SpinLock_TryLock(int * lock)
{
  int zero = 0;
  return compare_exchange_strong(lock,&zero,1);
}

which returns false if it couldn't get the lock (and then might do an OS wait) - you don't want to return false just because of a spurious failure. (that's not a great example, maybe I'll think of a better one later).

(BTW I think the C++0x stuff is a little bit broken, like most of C standardization, because they are trying to straddle this middle ground of exposing the efficient hardware-specific ways of doing things, but they don't actually expose enough to map directly to the hardware, and they also aren't high level enough to separate you from knowing about the hardware. For example none of their memory model actually maps directly to what x86 provides, therefore there are some very efficient x86-specific ways to do threading ops that cannot be expressed portable in C++0x. Similarly on LL-SC architectures, it would be preferrable to just have access to LL-SC directly.

I'd rather see things in the standard like "if LL-SC exist on this architecture, then they can be invoked via __ll() and __sc()" ; more generally I wish C had more conditionals built into the language, that would be so much better for real portability, as opposed to the current mess where they pretend that the language is portable but it actually isn't so you have to create your own mess of meta-language through #defines).

5 comments:

ryg said...: I'd rather see things in the standard like "if LL-SC exist on this architecture, then they can be invoked via __ll() and __sc()"
The problem with LL/SC is that they often come with limitations that are near-impossible to express in C/C++; e.g. on some implementations, you're not allowed to have any other memory access between a LL/SC pair (doing so will cause you to lose the reservation). In ASM that's no problem, just don't do loads/stores, but how do you express it in C/C++ where there's no binding way to force a value into a register, and even seemingly innocent things like adding a literal constant might require an extra memory access to load the constant from memory?

The truth is that on the micro level, most compilers aren't aware of such limitations, even if the underlying architecture or implementation has them. You just shove it all into intrinsics that are expanded very late into the compilation process, after any optimizations that might screw them up.; July 14, 2011 at 2:07 PM
cbloom said...: All of the platforms with LL/SC provide intrinsics for LL/SC , so you have that problem anyway. It's just at the moment I have to do rigamorale like

#if PPC
__lwarx()
#elif ARMV6
__ldrex()
#else
...

granted, you're right that it is ugly in C and can produce subtly broken code, but we have that problem anyway.

This is the classic cop-out of C standardization if it's messy or varies by platform they leave it undefined. But that doesn't make the problem go away. It just makes it *worse* because now it's left up to each compiler implementation on each platform, so now you have not only a messy problem, but one that's different everywhere and badly documented, and etc..; July 14, 2011 at 5:28 PM
cbloom said...: IMO there could easily be more things in the C standards that are like :

if the platform has a native vector type then the syntax for using it is "vector[4] int x;"

and then provide a mechanism like

#if platform supports(vector[4] int)
.. simd code ..
#else
.. non-simd code ..
#endif

I dunno, this is an unrelated rant.

Like I feel like C has two parts (actually make 3 or 4 parts), like there's the core bit of basic C stuff that every platform has to support. Then there's stuff that platforms may or may not have, for example I should be able to ask things like (#if platform unaligned-load-fast) - rather than just leaving that a big undefined mess.

Anyhoo...; July 14, 2011 at 5:31 PM
Anthony Williams said...: The non-member compare_exchange_weak and compare_exchange_strong functions take their arguments by pointer, for C compatibility. It is the member functions which take references. e.g.

std::atomic<int> ai;

int old=0;

if(ai.compare_exchange_strong(old,1))
do_something();

or alternatively

if(compare_exchange_strong(&ai,&old,1))
do_something();; July 26, 2011 at 1:50 PM
cbloom said...: Ah, thanks for the clarification.

I think non-const references are pretty much always horrible.

I literally thought there was a bug in my code the first time I ported some Win32 code from InterlockedCompareExchnage to C++0x compare_exchange because it kept changing my value of "old".

In fact back when I got to use C++ and pick my own coding style, I very much liked the convention that all const objects are passed by ref and all non-const are passed by pointer; this makes it super clear at the call site which args can be mutated and which can't. I miss that code base a lot. Oh well...; July 26, 2011 at 2:06 PM

cbloom rants

7/14/2011

07-14-11 - compare_exchange_strong vs compare_exchange_weak

5 comments:

old rants