These are some very good summaries, mainly by the MSDN guys :
MSDN - Lockless Programming Considerations for Xbox 360 and Microsoft Windows
MSDN - Concurrency What Every Dev Must Know About Multithreaded Apps
MSDN - Understanding Low-Lock Techniques in Multithreaded Apps
MSDN - Synchronization and Multiprocessor Issues (Windows)
Intel - Multiple Approaches to Multithreaded Applications
These articles have a good discussion of the memory models of various processors ; in particular Part 1 has a nice table that shows what each processor does (warning : there may be some slight wrongness in here about x86).
Here come a mess of links about memory models and barriers and what's going on. There's a lot of contradiction (!!) so be careful : For example - does LFENCE do anything at all on x86 for normal aligned integral reads? good question... btw SFENCE definitely does do nothing (I think?).
Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering
YALFAm (Yet Another Lock Free Approach, maybe) - comp.programming.threads Google Groups
Who ordered memory fences on an x86 � ��Bartosz Milewski�s Programming Cafe
The Double-Checked Locking is Broken Declaration
Synchronization and the Java Memory Model
Re Memory fence instructions on x86
Re LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
Re Intel x86 memory model question 2
Re Intel x86 memory model question 1
Memory Barriers, Compiler Optimizations, etc. - comp.programming.threads Google Groups
LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
cbrumme's WebLog Memory Model
And now some links about some real mad shit. Do not try any of this at home. I've found two mad geniuses : Chris Thomasson ("the AppCore guy") who has got some nutty x86 lock-free synchronization stuff that relies on details of x86 and does safe communication with zery fencing, and Dmitriy V'jukov ("the Relacey guy") who has written a ton of nutty awesome stuff for Intel.
Wait free queue - comp.programming.threads Google Groups
Thin Lock Implementation � ��Bartosz Milewski�s Programming Cafe
Paul E. McKenney Read-Copy Update (RCU)
Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
julian m bucknall Hazard Pointers (careful - wrong stuff in here)
Dr. Dobb's Lock-Free Queues July 1, 2008
Computer Laboratory - Cambridge Lock Free Library
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
Atomic Ptr Plus Project
AppCore A Portable High-Performance Thread Synchronization Library
Re Non-blocking queue for the 2 threads (AppCore)
refcount - mostly lock-free word-based atomic reference counting (AppCore)
Asymmetric rw_mutex with atomic-free fast-path for readers - Scalable Synchronization Algorithms Google Groups
Asymmetric Dekker Synchronizatio
There's also Intel's TBB :
It's a horrible name, it's not "Threading Building Blocks" at all. It's fucking OpenMP. It's not really a lib of little threading helpers, which would be handy, it's like fucking CUDA for CPUs.
Also just to repeat - I am in no way recommending that anybody go down this path. As I have said repeatedly - CRITICAL_SECTION is really fucking fast and smart and does exactly what you want. Just use it. If critical section locks are giving you problems it's because your architecture is wrong, not because the micro-routine is not fast enough.