7/18/2010

07-18-10 - Mystery - Does the Cell PPU need Memory Control -

Is memory ordering needed on the PPU at all ?

I'm having trouble finding any information about this, but I did notice a funny thing in Mike Acton's Multithreading Optimization Basics :

!

// Pentium
#define  AtomicStoreFence() __asm { sfence }
#define  AtomicLoadFence()  __asm { lfence }

// PowerPC
#define  AtomicStoreFence() __asm { lwsync }
#define  AtomicLoadFence()  __asm { lwsync }

// But on the PPU
#define  AtomicStoreFence() 
#define  AtomicLoadFence()

Now, first of all, I should note that his Pentium defines are wrong. So that doesn't inspire a lot of confidence, but Mike is more of a Cell expert than an x86 expert. (I've noted before that thinking sfence/lfence are needed on x86 is a common mistake; this is just another example of the fact that "You should not try this at home!" ; even top experts get the details wrong; it's pretty sick how many random kids on gamedev.net are rolling their own lock-free queues and such these days; just say no to lock-free drugs mmmkay). (recall sfence and lfence only have any effect on non-temporal memory such as write-combined or SSE; normal x86 doesn't need them at all).

Anyway, the issue on the PPU is you have two hardware threads, but only one core, and more importantly, only one cache (and only one store queue (I think)). The instructions are in order, all of the out-of-orderness of these PowerPC chips comes from the cache, so since they are on the same cache, maybe there is no out-of-orderness ? Does that mean that memory accesses act sequential on the PPU ?

Hmmm I'm not confident about this and need more information. The nice thing about Cell being open is there is tons of information about it from IBM but it's just a mess and very hard to find what you want.

Of note - thread switches on the Cell PPU are pretty catastrophically slow, so doing a lot of micro-threading doesn't really make much sense on that platform anyway.

ADDENDUM : BTW I should note that even if the architecture doesn't require memory ordering (such as on x86), doing this :


#define  AtomicStoreFence() 
#define  AtomicLoadFence()

is a bad idea, because the compiler can still reorder things on you. Better to do :

#define  AtomicStoreFence()  _CompilerWriteBarrier() 
#define  AtomicLoadFence()   _CompilerReadBarrier()

5 comments:

malte said...

Afaik the PPU still have a load-miss queue and the same store-gather buffer mechanism as Xenon, which could execute loads/stores out of program order even to a single core.

ryg said...

It's a bit more subtle than that.

Yes, the loads and stores can "execute" out of order, but in this particular case that doesn't mean much. It doesn't matter which order you execute loads in if the memory doesn't change in the meantime. Similarly, the completion order of stores doesn't matter if nobody does any loads until the last store is retired.

This simplifies matters considerably if there are no external agents modifying memory.

Processors do have mechanisms in place to ensure causality within a single thread (kinda obvious, after all we don't need to place fences in single-threaded programs, ever). If you walk through the cases, the only problematic operation in this scenario is when a load accesses memory at an address that is being accessed by an in-flight store that's not yet completed - and we know how the CPU reacts in that case: that's a load-hit-store, which causes the PPU to stall until the store buffer has been written to L1.

Anyway, the actual issue: multiple hardware threads. It heavily depends on how the particular implementation looks - in particular, how the store buffers were adapted to multiple HW threads. The only reasonable implementation I can think of on PPC is having one shared set of store buffers used by all HW threads.

In this implementation, as far as loads/stores are concerned, the HW multithreading boils down to picking instructions alternately from thread A and B and interleaving them into one instruction stream. Loads and stores are sequentially consistent within that core, as any dangerous sequences are automatically serialized by triggering LHS stalls.

So if they're doing that, then you're indeed home free without explicit memory barriers as long as only the PPU touches memory. When you're communicating with SPUs (e.g. job queues in main memory), you definitely need memory barriers.

cbloom said...

" The only reasonable implementation I can think of on PPC is having one shared set of store buffers used by all HW threads.

In this implementation, as far as loads/stores are concerned, the HW multithreading boils down to picking instructions alternately from thread A and B and interleaving them into one instruction stream. Loads and stores are sequentially consistent within that core, as any dangerous sequences are automatically serialized by triggering LHS stalls."

Yep, this is exactly my suspicion. It would be nice to get some real confirmation that this is the case, but it's hard to image anything else.

Note that this doesn't mean you don't have to do anything for shared variables; they still have to use "atomic" ops to be consistent, and you have to make sure that accesses on them actually go to memory, they just don't need to lwsync or whatever to enforce cache line timing order.

Anonymous said...

I've written a small sample program which provides at least partial evidence of your theory. You can check it out here.

cbloom said...

Cool, nice work. I need to go ahead and try my test suite without ordering on PS3. I finally wrote my own Relacy-like simulator that I can run on any of our platforms, so I should be able to test this stuff somewhat exhaustively in the future.

old rants