11/12/2014

11-12-14 - Intel TSX notes

Had a little look at Intel TSX. Some sloppy notes.

TSX are new instructions for (limited) hardware transactions.

I'm gonna talk mainly about the newer RTM (XBEGIN/XEND) because it's conceptually clearer than XACQUIRE/XRELEASE.

In general RTM allows a chunk of speculative code to be run, which can be fully rolled back, and which is not visible to any other core until it is committed. This allows you to do a transaction involving several memory ops, and only commit if there was no contention, so the entire transaction appears atomically to other cores.

The TSX documentation talks about it a lot in terms of "hardware lock elision" (HLE) but it's more general than that. (and HLE the algorithmic process is not to be confused with the annoyingly named "HLE instructions" (XACQUIRE/XRELEASE))

TSX does not gaurantee forward progress, so there must always be a fallback non-TSX pathway. (complex transactions might always abort even without any contention because they overflow the speculation buffer. Even transactions that could run in theory might livelock forever if you don't have the right pauses to allow forward progress, so the fallback path is needed then too).

TSX works by keeping a speculative set of registers and processor state. It tracks all reads done in the speculation block, and enqueues all writes to be delayed until the transaction ends. The memory tracking of the transaction is currently done using the L1 cache and the standard cache line protocols. This means contention is only detected at cache line granularity, so you have the standard "false sharing" issue.

If your transaction reads a cache line, then any write to that cache line by another core causes the transaction to abort. (reads by other cores do not cause an abort).

If your transaction writes a cache line, then any read or write by another core causes the transaction to abort.

If your transaction aborts, then any cache lines written are evicted from L1. If any of the cache lines involved in the transaction are evicted during the transaction (eg. if you touch too much memory, or another core locks that line), the transaction is aborted.

TSX seems to allow quite a large working set (up to size of L1 ?). Obviously the more memory you touch the more likely to abort due to contention.

Obviously you will get aborts from anything "funny" that's not just plain code and memory access. Context switches, IO, kernel calls, etc. will abort transactions.

At the moment, TSX is quite slow, even if there's no contention and you don't do anything in the block. There's a lot of overhead. Using TSX naively may slow down even threaded code. Getting significant performance gains from it is non-trivial.

RTM Memory Ordering :

A successful RTM commit causes all memory operations in the RTM region to appear to execute atomically. A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction. The XBEGIN instruction does not have fencing semantics. However, if an RTM execution aborts, all memory updates from within the RTM region are discarded and never made visible to any other logical processor.

One of the best resources is the new Intel Optimization Manual, which has a whole chapter on TSX.

RTM is very nice as a replacement for traditional lockfree algorithms based on atomics when those algorithms are very complex. Something simple like just an atomic increment, you obviously shouldn't use RTM, just do the atomic. Even something like a lockfree LIFO Stack will be better with traditional atomics. But something complex like a lockfree MPMC FIFO Queue will be appropriate for RTM. (an example MPMC FIFO requires two CAS ops, and an atomic load & store, even without contention; so you can replace all those atomic ops with one RTM section which either commits or aborts)

RTM handles nesting in the simplest way - nested transactions are absorbed into their parent. That is, no transaction commits until the topmost parent commits. Aborts in nested transactions will abort the parent.

BEWARE : the transactional code might not need the old fashioned lock-free atomics, but you do still have to be careful about what the optimizer does. Use volatiles, or perhaps relaxed-order atomic stores to make sure that the variable reads/writes you think are happening actually happen where you expect!!


I think an interesting point is that elided locks don't act like normal locks.

Consider a simple object protected by a lock :


struct Thingy
{
    int m_lock; // 0 if unlocked
    int m_a;
    int m_b;
};

The lock provides mutual exclusion for work done on m_a and m_b.

Let's pretend we protected it using a simple spin lock, so a function that used it would be like :


void DoStuff( Thingy * t )
{

while( AtomicExchange(t->m_lock,1) != 0 )
{
  // someone else has the lock; spin :
  pause;
}

// I own the lock
// do stuff to m_a and m_b in here :
DoStuff_Locked(t);

AtomicStore(t->m_lock,0); // release the lock

}

Now we replace it with the RTM transactional version :

void DoStuff( Thingy * t )
{

if ( _xbegin() )
{
  // transactional path

  // add "m_lock" to the transaction read-set :
  if ( t->m_lock != 0 )
  {
    // lock is held, abort transaction
    _xabort();
  }

    // do stuff to m_a and m_b in here :
    DoStuff_Locked(t);

  _xend();
}
else
{
  // transaction aborts go here
  // normal non-transactional path :

    while( AtomicExchange(t->m_lock,1) != 0 )
    {
      // someone else has the lock; spin :
      pause;
    }

    // I own the lock
    // do stuff to m_a and m_b in here :
    DoStuff_Locked(t);

    AtomicStore(t->m_lock,0); // release the lock
}

}

So, how does this work. Let's have a look :

In the transactional path, m_lock is not written to. The lock is not actually held. We do make sure to *read* m_lock so that if another thread takes the lock, it aborts our transaction. Our transaction will only complete if no other thread writes the memory we access.

In fact, the transactional path does not provide mutual exclusion. Multiple threads can read from the same object without conflicting. As lock as the "DoStuff_Locked" only reads, then the transactions will all proceed. In this sense, RTM has converted a normal lock into a read-writer lock!

The transactional path is fine grained! The way we wrote the code is coarse-grained. That is, m_lock protects the entire object, not the individual fields. So if thread 1 tries to modify m_a , and thread 2 modifies m_b, they must wait for each other, even though they are not actually racing. The transactional path will let them both run at the same time, provided there's cache line padding to prevent false sharing.

To be clear :


Transactional :

T1 :
read m_lock
write m_a

T2 :
read m_lock
write m_b

can run simultaneously with no conflict

traditional/fallback lock :

T1 :
take m_lock
write m_a

T2 :
take m_lock
write m_b

must synchronize against each other.

NOTE : as written the transactional path will actually fail all the time, because all the variables are on the same cache line. They need cache-line-size padding between the fields. Perhaps most importantly, the mutex/lock variable that's checked should be on its own cache line that is not written to on the transactional path.

Obviously this doesn't seem too interesting on this little toy object, but on large objects with many fields, it means that operations working on different parts of the object don't synchronize. You could have many threads reading from part of an object while another write writes a different part of the object with no conflict.

It's pretty interesting to me that the elided lock behaves very differently than the original lock. It changes to an RW lock and becomes fine-grained.


Overall I think TSX is pretty cool and I hope it becomes widespread. On the other hand, there is not much real world benefit to most code at the moment.

Some links :

Scaling Existing Lock-based Applications with Lock Elision - ACM Queue
Lock elision in glibc 01.org
TSX anti patterns in lock elision code Intel� Developer Zone
Exploring Intel� Transactional Synchronization Extensions with Intel� Software Development Emulator Intel� Developer Zone
Web Resources about Intel� Transactional Synchronization Extensions Intel� Developer Zone
Fun with Intel� Transactional Synchronization Extensions Intel� Developer Zone

11/11/2014

11-11-14 - x64 movdqa atomic test

How to do an atomic load/store of a 128 bit value on x64 is an ugly question.

The only guaranteed way is via cmpxchg16b. But that's an awfully slow way to do a load or store.

movdqa appears to be an atomic way to move 128 bits - on most chips. Not all. And Intel/AMD don't want to clearly identify the cases where it is atomic or not. (they specifically don't guarantee it)

At the moment, shipping code needs to use cmpx16 to be safe. (my tests indicate that the x64 chips in the modern game consoles *do* have atomic movdqa, so it seems safe to use there)

My main curiousity is whether there exist any modern ("Core 2" or newer) x64 chips that *do not* provide an atomic movdqa.

Anyhoo, here's a test to check if movdqa is atomic on your machine. If you like, run it and send me the results : (Windows, x64 only)

cb_x64_atomics_test.7z

The atomic test will just run for a while. If it has a failure it will break.

(you will get some errors about not having a v: or r: drive; you can just ignore them.)

Copy the output, or should be able to get the log file here :

"c:\oodlelogs\oodle_cb_test_x64_atomics.log"
"c:\oodlelogs\oodle_cb_test_x64_atomics.prev"

email me at cb at my domain.


For the record, an atomic 128 bit load/store for x64 Windows using cmpx16 :


#include <intrin.h>

void AtomicLoad128(__int64 * out_Result, __int64 volatile * in_LoadFrom)
{
    // do a swap of the value in out_Result to itself
    //  if it matches, it stays the same
    //  it it doesn't match, we get a load
    _InterlockedCompareExchange128(in_LoadFrom,out_Result[1],out_Result[0],out_Result);
}

void AtomicStore128(__int64 volatile * out_StoreTo,const __int64 * in_Value)
{
    // do an initial non-atomic load of StoreTo :
    __int64 check_StoreTo[2];
    check_StoreTo[0] = out_StoreTo[0];
    check_StoreTo[1] = out_StoreTo[1];
    // store with cmpx16 :
    while( ! _InterlockedCompareExchange128(out_StoreTo,in_Value[1],in_Value[0],check_StoreTo) )
    {
        // check_StoreTo was reloaded with the value in out_StoreTo
        _mm_pause();
    }
}