05-26-09 - Automatic Multithreading

Bartosz made an interesting post about extending D for automatic multithreading with some type system additions.

It made me think about how you could do guaranteed-safe multithreading in C++. I think it's actually pretty simple.

First of all, every variable must be owned by a specific "lock". It can only be accessed if the current thread owns that lock. Many of the ownership relationships could be implicit. For example there is an implicit lock for every thread for stuff that is exlusively owned by that thread. That thread almost has that lock, so you never actually generate a lock/unlock, but conceptually those variables still have a single lock that owns them.

So, stack variables for example are implicitly automatically owned by the thread they are made on. Global variables are implicitly owned by the "main thread" lock if not otherwise specified. If some other thread tries to touch a global, it tries to take a lock that it doesn't own and you fail.


Lock gate1;

int shared : gate1; // specifies "shared" is accessed by gate1

int global; // implicitly owned by the main thread

void thread1()
    int x; // owned by thread1

    x = shared; // fail! you must lock gate1

        x = shared; // ok

    y = global; // fail! you don't own global

Mkay that's nice and all. Single threaded programs still just work without any touches, everything is owned by the main thread. Another goal I think of threading syntax additions should be that going from a large single threaded code base to adding a few threading bits should be easy. It is here.

There are a few things you would need to make this really work. One is a clean transfer of ownership method as Bartosz talks about. Something like auto_ptr or unique_ptr, but actually working in the language, so that you can pass objects from one owner to another and ensure that no refs leak out during the passage.

You can also of course extend this if you don't want the constraint that everything is protected by a lock. For example you could easily add "atomic" as a qualifier instead of a lock owner. If something is marked atomic, then it can be accessed without taking a lock, but it's only allowed to be accessed by atomic operations like cmpx/CAS.

This is a nice start, but it doesn't prevent deadlocks and still requires a lot of manual markup.

I also finally read a bit about Sun's Rock. It's very interesting, I encourage you to read more about it at the Sun Scalable Synchronization page.

Rock is actually a lot like LRB in many ways. It's 16 lightweight cores, each of which has 2 hardware threads (they call them strands). It's basically a simple in-order core, but it can do a sort of state-save push/pop and speculative execution. They have cleverly multi-purposed that functionality for both the Transactional Memory, and also just for improving performance of single threaded code. The state-save push-pop is a ghetto form of out-of-order-execution that we all know and love so much. It means that the chip can execute past a branch or something and then if you go the other way on the branch, it pops the state back to the branch. This is just like checkpointing a transaction and then failing the commit !

The key thing for the transactional memory is that Rock is NOT a full hardware transactional memory chip. It provides optimistic hardware transactions with some well designed support to help software transactional memory implementations. The optimistic hardware transactions basically work by failing to commit if any other core touches a cache line you're writing. Thus if you do work in cache lines that you own, you can read data, write it out, it gets flushed out of the cache to the global consistent view and there are no problems. If someone touches that cache line it invalides the transaction - even though it might not necessarilly need to. That's what makes it optimistic and not fully correct (there are other things too). If it allows a transaction through, then it definitely was okay to do, but it can fail a transaction that didn't need to fail.

There's a lot of problems with that, it can fail in cases that are perfectly valid transactions, so obviously you cannot rely on that as a TM implementation. However, it does let a lot of simple transactions successfully complete quickly. In particular for simple transactions with no contention, the optimistic hardware transaction completes no problemo. If it fails, you need to have a fallback mechanism - in which case you should fall back to your real STM implementation, which should have forward progress guarantees. So one way of look at the HTM in Rock is just a common case optimization for your STM software. The commit in Rock has a "jump on fail" so that you can provide the failure handling code block; you could jump back to your checkpoint and try again, but eventually you have to do something else.

Perhaps more interestiongly, the HTM in Rock is useful in other ways too. It gives you a way to do a more general ll-sc (load linked store conditional) kind of thing, so even if you're not using STM, you can build up larger "atomic" operations for your traditional lock/atomic C++ style multithreading. It can also be used for "lock elision" - avoiding actually doing locks in traditional lock-based code. For example if you have code like :


    x = y;


that can be transparently converted to something like :

    checkpoint; // begin hardware transaction attempt

    x = y;

    commit  // try to end hardware transaction, if fails :
    failure {

        lock( CS)   
        x = y;


So you avoid stalling if it's not needed. There are of course tons of scary subtleties with all this stuff. I'll let the Sun guys work it out.

It's also actually a more efficient way of doing the simple "Interlocked" type atomic ops. On x86 or in a strongly ordered language (such as Java's volatiles) the Interlocked ops are fully sequentially consistent, which means they go in order against each other. That actually is a pretty hard operation. We think of the CMPX or CAS as being very light weight, but it has to sync the current core against all other cores to ensure ops are ordered. (I wrote about some of this stuff before in more detail in the threading articles). The "transaction" hardware with a checkpoint/commit lets your core flow through its ops without doing a big sync on interlocked ops. Now, obviously the checkpoint/commit itself needs to be synchronized against all other cores, but it's done on a per cache-line basis, and it uses the cache line dirtying communication hardware that we need anyway. In particular in the case of no contention or the case of doing multiple interlocked ops within one cache line, it's a big win.

I'm sure that Sun will completely cock things up somehow as they always do, but it is very interesting, and I imagine that most future chips will have models somewhat like this, as we all go forward and struggle with the issue of writing software for these massively parallel architectures.


Autodidactic Asphyxiation said...

Pending thread annotations in GCC:


The architecture folks realized pretty early on that transactions and speculation are basically the same thing. I think it is also fairly obvious that you don't need to do a fully hardware implementation of transaction memory, just like you don't need to do a fully hardware implementation of virtual memory.

I like the idea of a processor being able to dynamically choose to speculate or hyperthread, or just save power. It would be interesting to see what the "branch predictor" (or whatever would make that decision) would look like. I wonder if the additional complexity would be worthwhile. I suppose the problem is that to actually do speculation well you also need to go out-of-order, and things quickly get big and messy.

cbloom said...

"The architecture folks realized pretty early on that transactions and speculation are basically the same thing."

Yeah I didn't see it; it's cool.

"I like the idea of a processor being able to dynamically choose to speculate or hyperthread, or just save power."

Apparently Rock can be controlled from software to use both hardware hyperthreads to execute a single software thread. In that case one of the hyperthreads runs ahead and speculatively executes and prefetches stuff, while the other one hangs behind and retires what actually happened.

It's cool to be able to control stuff like that from the OS so you have control over power use, whether you target fewer or more threads, etc.

ryg said...

"Yeah I didn't see it; it's cool."
It's interesting how working on hardware, or even just thinking about how you would specify something if it was meant to be implemented as hardware, changes your perspective.

It changes your focus from "what am I trying to accomplish" to "what am I actually doing", which is very liberating sometimes. I remember reading some time ago about how LLVM used to have signed and unsigned integers as distinct types, which they later switched that to just having integer types with distinct signed/unsigned add, sub and multiply instructions, which apparently solved several design problems for them.

This kind of actually peeling high-level information away where it doesn't contribute anything anymore is just as tricky as getting the high-level info to where you need it.

nothings said...

Does your example of lock avoidance actually work? It seems like the hardware transaction needs to be blocked by the lock; i.e. you don't want to fail a transaction, grab the lock, do part of the locked code, then pause and another thread comes along and runs the transaction code and completes it successfully.

But I may not be understanding it correctly.

cbloom said...

I think the key to that is that the light transaction version still checks a bool in the lock. That does two things - if the bool is set it's locked and you don't do the transaction, but it also makes you dependent on the cache line of the lock, so if anyone else is changing the lock at the same time you fail the transaction.

cbloom said...

So the transaction version is something like :


if ( lock.set )
goto fallback

lock.set = true

.. do stuff ..

lock.set = false

on failure goto fallback

So the commit only goes through when I was the only person who got their hands on the lock during that time.

Or something, I dunno, it's more complicated than that I'm sure.

cbloom said...

and of course to be clear on the context, this isn't a general transaction thing, it's just needed when you're trying to replace locks and/or use locks as the fallback mechanism.

old rants