02-23-09 - Low Level Threading - Annotated Linkies

Part 1 : Background

Stuff about C++0x. I now believe that C++0x is the right way to talk about multi-threading. You require the basic type to be atomic (eg. fit in a machine word), and then you add specific memory order constraints to each operation - load, store, or exchange - that constrain the program to be correct. The goal is writing good lock-free code is to find the minimum necessary and sufficient memory constraints to make the program correct. The lighter the constraints that you need to apply, the more efficient it can be.

Multicores and Publication Safety � ��Bartosz Milewski�s Programming Cafe
C++ atomics and memory ordering � ��Bartosz Milewski�s Programming Cafe
Hans Boehm : Why atomics have integrated ordering constraints

Stuff on the basics of "volatile" and memory ordering and how to make the compiler do what you want. The summary is that "volatile" has changed meaning so much that you should probably just not use it at all. The main issues to worry about are : compiler reordering and what the compiler does with your code, memory ordering constraints (cache/bus locking and flushing constraints), CPU reordering of instructions and memory timing vs cpu instruction timing.

It Goes To Eleven The Joys of Compiler and Processor Reordering Why You Technically Need Read-Side Barriers
Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Intel - Volatile : Almost Useless for Multi-Threaded Programming Memory ordering, barriers, acquire and release semantics niallryan.com
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Dr. Dobb's volatile vs. volatile January 8, 2009
Acquire and Release Semantics
Memory barrier - Wikipedia, the free encyclopedia
Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8 Visual C++ Language Visual C++ MSDN Forums

The MSDN stuff is all good :

Understanding Low-Lock Techniques in Multithreaded Apps
Rules and Limitations for TLS
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Synchronization and Multiprocessor Issues (Windows)
Concurrency What Every Dev Must Know About Multithreaded Apps
Break Free of Code Deadlocks in Critical Sections Under Windows
Under the Hood Reduce EXE and DLL Size with LIBCTINY.LIB
Multiple Approaches to Multithreaded Applications - Intel� Software Network

I think that "cmpxchg" on x86 with no lock prefix is equivalent to a C++0x compare_exchange(memory_order_acq_rel) , as opposed to "lock cmpxchg" which is memory_order_seq_cst :

Boehm, Hans - locks on X86
x86 synchronization primitives

This next link I believe holds the secrets to the mysteries of the x86 load reordering. All loads on the x86 are "acquire", which means they appear to go in order. (an "acquire" load means that later loads cannot move before the acquire, but earlier ones could move after it - when all loads are acquire it means they all go in order). However, we all know the x86 can speculatively execute loads out of order. WTF !? Something is seriously wrong with those two facts. In fact, loads are executed out of order, but if writes cause any changes during that time, they are invalidated and the loads are reissued :

details of x86 load reorder

This post gives you some idea of what the processors have to do to enforce the appearance of in-order loads. Also note : "Any memory access that falls within a single cache line is guaranteed to be atomic". (x86 specific)

Gustavo Duarte talks about Cache details

General note : I find it very valuable to understand what is going on under the hood. HOWEVER you should not write code that relies on the details of one processor or OS or compiler implementation. The code should be descriptive about the constraints that it needs, then you can have library routines which enforce that model for the specific OS/compiler/CPU.

Some general articles about how lock-free and CAS are not a panacea. It's important to remember that a CAS (any "lock" instruction) is a bus lock which is very slow - not just for your processor, but for all other processors looking at that cache line! In even moderately contended algorithms, the cost of CAS for N processors goes up like O(N). The means if you tried to implement a huge multiprocessor machine where people used Interlocked ops to synchronize, as you added more and more processors eventually the whole thing would be running in lockstep with every processor waiting on the bus all the time. In fact, a CAS-spin-retry is really fundamentally not much different than a short Critical Section.

Joe Duffy's Weblog - CAS Performance kills you
Dr. Dobb's Maximize Locality, Minimize Contention May 23, 2008
Dr. Dobb's Lock-Free Code A False Sense of Security September 8, 2008
What is the real cost of LOCK on x86 ?

Good article about how Windows SLIST deals with 64-bit pointers and the requirement to do CAS with only a 64-bit atomic in early x64 chips. With 32-bit pointers we often make a {pointer,counter} object in 64 bits and use cmpxchg8b. New chips have cmpxchg16b. Sadly there was a period when Intel made 64-bit chips with no cmpxchg16b.

Alex Ionescu�s Blog � Blog Archive � Behind Windows x64�s 44-bit Virtual Memory Addressing Limit

A bunch of Herb's junk : Herb develops a lock free queue (FIFO) which is basically just a SPSC queue, and then makes an MPMC variant by putting a lock on each end. This is a general technique - any SP or SC structure can be changed to MP or MC by adding a "producer lock" or "consumer lock". SP or SC work by knowing that only one owner is touching that end at a time. That "owner" can either be a specific thread, or it can be designated as the holder of the lock.

Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007
Dr. Dobb's Lock-Free Queues July 1, 2008
Dr. Dobb's Writing Lock-Free Code A Corrected Queue September 29, 2008
Dr. Dobb's Measuring Parallel Performance Optimizing a Concurrent Queue December 1, 2008

Big collections of articles :

Dr. Dobb's Go Parallel Features
Computer Laboratory -
AMD Developer Central - Articles & Whitepapers
Sun Scalable Synchronization Research Group
Sun Labs Scalable Synchronization Research Group Publications

Worker pools and queues :

Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
Re Non-blocking queue for the 2 threads
Joe Duffy's WSQ 1
Joe Duffy's WSQ 2
Joe Duffy's WSQ 3
Creating a thread safe producer consumer queue in C++ without using locks - Cluebat-man to the rescue

Some shit about singletons and such :

Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Double-checked locking - Wikipedia, the free encyclopedia
The Old New Thing C++ scoped static initialization is not thread-safe, on purpose!
The Old New Thing An auto-reset event is just a stupid semaphore

Major lock-free code bases :

Paul E. McKenney Read-Copy Update (RCU)
Atomic Ptr Plus Project
TBB Home
Download details Parallel Extensions to .NET Framework 3.5 June 2008 CTP
Getting Started with the .NET Task Parallel Library Multi-Core Case Studies
Intel� Performance Libraries - Intel� Software Network
Intel� Thread Checker 3.1 for Windows - Intel� Software Network
Intel� Thread Profiler 3.1 for Windows - Intel� Software Network
Synchronization Algorithm Verificator for C++0x - Intel� Software Network
transforming single-threaded allocators... - Intel� Software Network

Some specific code samples (WARNING : OFTEN WRONG !! I have yet to find any reliable source of lock-free code on the net that you can just grab and use)

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Kernel space Ticket spinlocks - LinuxWorld
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
CodeProject Single Producer Single Consumer, with no critical sections. Free source code and programming help
Lock-Free Linked List
SourceForge.net Spin-X Engine
Lock free double ended queue .mischief.mayhem.soap.
Single producersingle consumer bounded queue .mischief.mayhem.soap.
Lock-free, multi-threaded Mandelbrot .mischief.mayhem.soap.

ADDENDUM : Found some more goodies : Java Memory Model guide for compiler writers by Doug Lea. This is a very good overall discussion of memory models and how to implement them on various architectures.

Example implementation of C++0x on Power architecture by Paul McKenney.

fences in C++0x by Peter Dimov.

Joe Duffy on why critsecs need fences

Bartosz on the rare need for fences on x86

Part 2 : V'jukov and Thomasson

Now we get into the part that's actually useful. (note the are places where they "sketch" algorithms - obviously don't try to grab a sketch code snippet and stick it in your code and assume it works)

Any lock-free code that has not been tested with Relacy is basically worthless. Anybody can write lock-free code and it "seems to work" - it's shit, it's garbage, it's no help at all. The hard part is in the testing and absolutely being sure it's correct, just sketching up something that "looks right" can be done by any monkey.

Relacy Race Detector Google Groups

Dmitriy V'jukov who wrote Relacy also has lots of great posts around the net. The most valuable information we will find is by him and one other guy :

Chris Thomasson, author of AppCore : the code in AppCore is mostly ASM and x86 specific. Chris has lots of great writing on lock-free stuff around the net, some more links to him will follow. He believes that doing a lot of this in ASM is easier because you can see the exact ordering of ops and don't have to worry about what kind of madness the compiler is doing. I think he's right to some extent, I wind up spending half my time trying to ensure the compiler generates the right single instruction. For example his SPSC queue is incredibly simple and efficient.

AppCore A Portable High-Performance Thread Synchronization Library
vzoom refcount - mostly lock-free word-based atomic reference counting

Our two heroes often reside here :

Scalable Synchronization Algorithms Google Groups

Discussion with Thomasson where he describes how the Fober LIFO is broken and how to fix it : (also contains his x86 ASM for a LF LIFO and small mention of the fact that stack can touch freed memory, but in a benign way, so that just catching the access violation and making it a retry is a good fix) :

Solving the memory issue of lock-free stack

Thomasson sketch of work-requesting system : (unlike work-stealing , work-requesting lets you used a non-thread-safe stack for your own work items) :

VERY crude work-requesting environment... - comp.programming.threads Google Groups

Thomasson and V'jukov discuss the SPMC WSDQ of Hendler & Lev :

ABP stealing deque algorithm - comp.programming.threads Google Groups

Thomasson sketches a LF SPMC WSDQ (ignore all the junk posts at the top and scroll down to Thomasson and V'jukov) :

Single producer, one-shot work queue implementation, no lock on queue. - comp.programming.threads Google Groups

In this post Thomasson sketches the method of using an MPMC stack and just exchaning the head with NULL to grab the whole thing, then reversing it to make it FIFO instead of LIFO. This is a very efficient and simple way to do an MPSC FIFO. Also more interesting discussion by V'jukov at the bottom about garbage collection and such :

Wait free queue - comp.programming.threads Google Groups

A related thread - this starts off with the MPSC method using stack-swap, but then V'jukov drops a gem of a new MPSC method :

A Correct, Lock-free FIFO with a single CAS per complete putget cycle - Scalable Synchronization Algorithms Google Groups
V'jukov intrusive MPSC node FIFO here

Another WSDQ where Thomasson and V'jukov try to straighten things out :

Double ended queue - review - comp.programming.threads Google Groups

Again discussion starts off with mystification about the x86 details, and leads to Thomasson posting his SPSC and straightening out the clowns :

Memory barrier + volatile (C++) - comp.programming.threads Google Groups

V'jukov's very cool node allocator for FIFO queues that does no atomic ops at all ! (amortized). FIFO's are a very bad case for standard multi-threaded allocators like tcmalloc, because every single block is being allocated on one thread and freed on another, which requires them to do a cross-thread pass-back (or causes heavy slab contention).

Lock-free node allocator for fifo-queues - Scalable Synchronization Algorithms Google Groups

V'jukov's simple example of an SPSC queue :

Single-ProducerSingle-Consumer Queue - Intel� Software Network

V'jukov talks about CAS performance and issues of MPMC and other structures, mentions SList exception catching technique : (go to page 2)

Concurrent stack - Intel� Software Network

Finally : Thomasson's slab allocator

Dmitriy sometimes uses the opposite meaning of "head" and "tail" from how I think and it tilts the hell out of me. I always thing of "head" as where you push, and you pop from the tail. So like stacks only have a head, you push & pop the head, FIFOs you enqueue on the head and dequeue on the tail, he usually does the opposite. That is, in a singly-list linked I think of links as pointing from the head towards the tail, so if you walk links you get from the head to the tail.

ADDENDUM : Found some more goodies :

Thomasson describes vZOOM architecture

V'jukov Memory Passing Queue good for thread caching allocators passing blocks back to the main thread for example, but also other things.


Brian said...

Are there any lock-free implementations of multi-producer or multi-consumer queues?

dfan said...

No way, man. The head is where you get stuff.

If you think of any chronologically ordered set of things, isn't the tail the back end of it, the farthest from being ready?

When you get in line, do you get in at the head of the line or the tail of the line??

cbloom said...

Yeah the line analogy certainly goes well with "head" being where you pop things.

BTW stacks also confuse me. Some people talk about popping the bottom of the stack. I guess they're thinking about the say stacks are usually done in memory with the bottom growing downward. I usually think of stacks as a literal physical "stack" like a deck of cards, so you push & pop the top of the stack, and the bottom is rarely seen.

cbloom said...

"Are there any lock-free implementations of multi-producer or multi-consumer queues?"

Next post. Or maybe the one after that ;)

cbloom said...

Though the links I posted contained some MP and MC algorithms if you're brave. Note that you almost always want MPSC or SPMC not MPMC , and I'll elaborate more in the next post.

Brian said...

Hmm. I wonder if in the MPSC case if you do better with 1 MPSC queue or a bunch of SPSC queues and a round robin consumer strategy?

dfan said...

Well, I (and all right-thinking people) do agree with you about stacks; the head (where you push and pop) is the top of the stack. I think every time I saw the concept of a stack introduced, they explained it in terms of a stack of plates.

old rants