11/12/2014

11-12-14 - Intel TSX notes

Had a little look at Intel TSX. Some sloppy notes.

TSX are new instructions for (limited) hardware transactions.

I'm gonna talk mainly about the newer RTM (XBEGIN/XEND) because it's conceptually clearer than XACQUIRE/XRELEASE.

In general RTM allows a chunk of speculative code to be run, which can be fully rolled back, and which is not visible to any other core until it is committed. This allows you to do a transaction involving several memory ops, and only commit if there was no contention, so the entire transaction appears atomically to other cores.

The TSX documentation talks about it a lot in terms of "hardware lock elision" (HLE) but it's more general than that. (and HLE the algorithmic process is not to be confused with the annoyingly named "HLE instructions" (XACQUIRE/XRELEASE))

TSX does not gaurantee forward progress, so there must always be a fallback non-TSX pathway. (complex transactions might always abort even without any contention because they overflow the speculation buffer. Even transactions that could run in theory might livelock forever if you don't have the right pauses to allow forward progress, so the fallback path is needed then too).

TSX works by keeping a speculative set of registers and processor state. It tracks all reads done in the speculation block, and enqueues all writes to be delayed until the transaction ends. The memory tracking of the transaction is currently done using the L1 cache and the standard cache line protocols. This means contention is only detected at cache line granularity, so you have the standard "false sharing" issue.

If your transaction reads a cache line, then any write to that cache line by another core causes the transaction to abort. (reads by other cores do not cause an abort).

If your transaction writes a cache line, then any read or write by another core causes the transaction to abort.

If your transaction aborts, then any cache lines written are evicted from L1. If any of the cache lines involved in the transaction are evicted during the transaction (eg. if you touch too much memory, or another core locks that line), the transaction is aborted.

TSX seems to allow quite a large working set (up to size of L1 ?). Obviously the more memory you touch the more likely to abort due to contention.

Obviously you will get aborts from anything "funny" that's not just plain code and memory access. Context switches, IO, kernel calls, etc. will abort transactions.

At the moment, TSX is quite slow, even if there's no contention and you don't do anything in the block. There's a lot of overhead. Using TSX naively may slow down even threaded code. Getting significant performance gains from it is non-trivial.

RTM Memory Ordering :

A successful RTM commit causes all memory operations in the RTM region to appear to execute atomically. A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction. The XBEGIN instruction does not have fencing semantics. However, if an RTM execution aborts, all memory updates from within the RTM region are discarded and never made visible to any other logical processor.

One of the best resources is the new Intel Optimization Manual, which has a whole chapter on TSX.

RTM is very nice as a replacement for traditional lockfree algorithms based on atomics when those algorithms are very complex. Something simple like just an atomic increment, you obviously shouldn't use RTM, just do the atomic. Even something like a lockfree LIFO Stack will be better with traditional atomics. But something complex like a lockfree MPMC FIFO Queue will be appropriate for RTM. (an example MPMC FIFO requires two CAS ops, and an atomic load & store, even without contention; so you can replace all those atomic ops with one RTM section which either commits or aborts)

RTM handles nesting in the simplest way - nested transactions are absorbed into their parent. That is, no transaction commits until the topmost parent commits. Aborts in nested transactions will abort the parent.

BEWARE : the transactional code might not need the old fashioned lock-free atomics, but you do still have to be careful about what the optimizer does. Use volatiles, or perhaps relaxed-order atomic stores to make sure that the variable reads/writes you think are happening actually happen where you expect!!


I think an interesting point is that elided locks don't act like normal locks.

Consider a simple object protected by a lock :


struct Thingy
{
    int m_lock; // 0 if unlocked
    int m_a;
    int m_b;
};

The lock provides mutual exclusion for work done on m_a and m_b.

Let's pretend we protected it using a simple spin lock, so a function that used it would be like :


void DoStuff( Thingy * t )
{

while( AtomicExchange(t->m_lock,1) != 0 )
{
  // someone else has the lock; spin :
  pause;
}

// I own the lock
// do stuff to m_a and m_b in here :
DoStuff_Locked(t);

AtomicStore(t->m_lock,0); // release the lock

}

Now we replace it with the RTM transactional version :

void DoStuff( Thingy * t )
{

if ( _xbegin() )
{
  // transactional path

  // add "m_lock" to the transaction read-set :
  if ( t->m_lock != 0 )
  {
    // lock is held, abort transaction
    _xabort();
  }

    // do stuff to m_a and m_b in here :
    DoStuff_Locked(t);

  _xend();
}
else
{
  // transaction aborts go here
  // normal non-transactional path :

    while( AtomicExchange(t->m_lock,1) != 0 )
    {
      // someone else has the lock; spin :
      pause;
    }

    // I own the lock
    // do stuff to m_a and m_b in here :
    DoStuff_Locked(t);

    AtomicStore(t->m_lock,0); // release the lock
}

}

So, how does this work. Let's have a look :

In the transactional path, m_lock is not written to. The lock is not actually held. We do make sure to *read* m_lock so that if another thread takes the lock, it aborts our transaction. Our transaction will only complete if no other thread writes the memory we access.

In fact, the transactional path does not provide mutual exclusion. Multiple threads can read from the same object without conflicting. As lock as the "DoStuff_Locked" only reads, then the transactions will all proceed. In this sense, RTM has converted a normal lock into a read-writer lock!

The transactional path is fine grained! The way we wrote the code is coarse-grained. That is, m_lock protects the entire object, not the individual fields. So if thread 1 tries to modify m_a , and thread 2 modifies m_b, they must wait for each other, even though they are not actually racing. The transactional path will let them both run at the same time, provided there's cache line padding to prevent false sharing.

To be clear :


Transactional :

T1 :
read m_lock
write m_a

T2 :
read m_lock
write m_b

can run simultaneously with no conflict

traditional/fallback lock :

T1 :
take m_lock
write m_a

T2 :
take m_lock
write m_b

must synchronize against each other.

NOTE : as written the transactional path will actually fail all the time, because all the variables are on the same cache line. They need cache-line-size padding between the fields. Perhaps most importantly, the mutex/lock variable that's checked should be on its own cache line that is not written to on the transactional path.

Obviously this doesn't seem too interesting on this little toy object, but on large objects with many fields, it means that operations working on different parts of the object don't synchronize. You could have many threads reading from part of an object while another write writes a different part of the object with no conflict.

It's pretty interesting to me that the elided lock behaves very differently than the original lock. It changes to an RW lock and becomes fine-grained.


Overall I think TSX is pretty cool and I hope it becomes widespread. On the other hand, there is not much real world benefit to most code at the moment.

Some links :

Scaling Existing Lock-based Applications with Lock Elision - ACM Queue
Lock elision in glibc 01.org
TSX anti patterns in lock elision code Intel� Developer Zone
Exploring Intel� Transactional Synchronization Extensions with Intel� Software Development Emulator Intel� Developer Zone
Web Resources about Intel� Transactional Synchronization Extensions Intel� Developer Zone
Fun with Intel� Transactional Synchronization Extensions Intel� Developer Zone

1 comment:

Fabian Giesen said...

TSX would be even nicer if you could currently buy a CPU that supports it correctly. :) (Intel issued a microcode update for all shipping Haswell/Broadwell parts that disables TSX a few months back, because they found a bug.)

A while ago I spent some time trying to figure out what the x86 memory model actually means for the underlying implementation. Long story short, the main thing you seem to need for the x86 memory model that you don't need at that level of sophistication for weaker memory models is the ability to detect and rewind past "causality violations". OoO/speculation needs most of the machinery for this already, but if sync points are relatively rare, you can implement things like PPC "lwsync" / ARM "dmb" by actually stalling until all prior mem ops have completed. In x86, where every load is "acquire" and every store is "release", that's not viable. What it boils down to is that given an incoming L1 cache line invalidation, you need to know whether any pending load/store depended on that cache line's previous value. That probably boils down to a bunch of extra bits per line in the L1 cache, plus some extra metadata in the memory ordering buffer.

TSX can reuse the same infrastructure. "Begin" needs to checkpoint, and it needs to flip one big switch, which is that while you're in a transaction, no pending loads/stores may retire until the transaction is done (because loads/stores not being retired is what allows the rewinding in the first place). You could OoO that state too but it's probably a full memory fence / LS pipe flush (which would explain high overhead). Limiting factor for # of outstanding ops would be either reorder buffer, memory ordering buffer or L1D$, whichever gives out first. You grab (and keep) all cache lines you write to in E. Without TSX, you can make sure that you only physically modify the L1D$ lines once you know the store is gonna go through. With TSX, you speculatively update and don't know if it's gonna take until the corresponding store is retired. Thus you can *some* of the transaction updates applied to cache lines in E state without the transaction committed. The original contents are gone (you didn't snapshot them) which is why you need to bounce lines to I if that happens. If you do make it to the end, you need to wait until all stores have made it to L1, and once they do, you can batch-retire everything (=your commit). At this point, all the data is where it needs to be already; all that's left to do is update the "current instruction to retire" counter.

That's the high concept anyway. In practice there's definitely a lot more extra wrinkles that I'm missing.

old rants