7/18/2010

07-18-10 - Mystery - Why no isync for Acquire on Xenon -

The POWER architecture docs say that to implement Acquire memory constraint, you should use "isync". The Xbox 360 claims they use "lwsync" to enforce Acquire memory constraint. Which is right? See :

Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming

Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :

First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control ordering you have ;

lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).

lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you put an lwsync after every instruction you would have a nice x86-like semantics.

In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me to make a sync point with all other cores.

sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.

Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.

isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions to finish their loads before any future instructions start (it flushes pipelines).

isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the article linked above they say :

"PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes."

All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but it also does a lot more that you don't need.

ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered.

7 comments:

ryg said...

Don't see anything wrong with using isync for read-acquire. But it's hard to say. Maybe email the authors of the example POWER C++ memory model implementation? All recent POWER cores are in-order like the Xenon/PPU.

"is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered."
The Lockless programming article says that "on Xbox 360 the compiler does not insert any instructions to prevent the CPU from reordering reads and writes". So that's not real acquire/release semantics either.

What the compiler actually seems to do (consistently across all architectures) is to compile volatile loads as "load; CompilerReadWriteBarrier;" and volatile stores as "CompilerReadWriteBarrier; store;". On x86, that gives you load-acquire and store-release semantics. But on PPC, it's just a relaxed store with a compiler barrier.

I don't think there's a way to get guaranteed loads/stores without compiler barriers from VC++ 2005 onwards.

cbloom said...

"All recent POWER cores are in-order like the Xenon/PPU."

Oh, right. isync is probably necessary/cheaper on out-of-order cores. On in-order cores maybe lwsync and isync are the same cost so it's simpler just to use lwsync all the time.

"The Lockless programming article says that "on Xbox 360 the compiler does not insert any instructions to prevent the CPU from reordering reads and writes". So that's not real acquire/release semantics either."

Oh shit, I didn't notice that little caveat. So the special 2005+ MSVC volatile is *not* for Xbox 360.

Which as you point out means that maybe MSVC volatile actually *never* generates special instructions, and in fact makes their volatile not so weird and different. They say it has "acquire/release" semantics, but of course on x86 all loads/stores do. So basically they're just saying the compiler won't muck it up. (maybe on Itanium they actually generate code for acquire/release memory ordering).

cbloom said...

"So the special 2005+ MSVC volatile is *not* for Xbox 360."

In the sense that it's not actually acquire/release. But it seems to generate the same thing as the x86 compiler , which is just a load/store and a compiler barrier. I need to investigate this a bit more to make sure I have all my facts right.

nothings said...

Tangential or relevant I dunno, since I'm a lazy fuck, but I believe POWER and PowerPC are not the same architecture.

This came up, for instance, in looking for that rotate-and-insert instruction, which exists in POWER but not PowerPC.

ryg said...

It's ridiculous case-sensitive naming.

POWER = IBM POWER processors (server)
POWER architecture = Architecture of same
Power architecture = The whole family including POWER, PowerPC, Cell and related. This covers the ISA and common parts between all of them.

POWER used to be slightly different from PowerPC, but they unified the ISAs some time ago. I think Cell/Xenon are already in the unified branch. The newer server processors definitely are.

"This came up, for instance, in looking for that rotate-and-insert instruction, which exists in POWER but not PowerPC."
There's multiple, and the most useful variant definitely exists on PPC: rlwimi/rldimi ("Rotate left [double]word immediate then mask insert"). I think that's been in there since the beginning (okay, the d variant is only in 64-bit PPCs).

Not sure if they removed or renamed the POWER-exclusive instructions for the unified ISA. As long as they didn't reuse the encoding space, they can just handle the "unknown instruction" traps and have the OS emulate the instructions to maintain backwards compatibility.

nothings said...

it's the "immediate" in that that's the issue -- the non-immediate one is lacking.

We were looking at code where if we could use that, we could cut back from two variable-length shifts to one.

nothings said...

rlmi

"Note: the rlmi instruction is supported only in the POWER family architecture."

This was, as you mentioned in the other thread, for bitstream parsing.

old rants