2/23/2009

02-23-09 - Low Level Threading - Annotated Linkies

Part 1 : Background

Stuff about C++0x. I now believe that C++0x is the right way to talk about multi-threading. You require the basic type to be atomic (eg. fit in a machine word), and then you add specific memory order constraints to each operation - load, store, or exchange - that constrain the program to be correct. The goal is writing good lock-free code is to find the minimum necessary and sufficient memory constraints to make the program correct. The lighter the constraints that you need to apply, the more efficient it can be.

Multicores and Publication Safety � ��Bartosz Milewski�s Programming Cafe
C++ atomics and memory ordering � ��Bartosz Milewski�s Programming Cafe
Hans Boehm : Why atomics have integrated ordering constraints

Stuff on the basics of "volatile" and memory ordering and how to make the compiler do what you want. The summary is that "volatile" has changed meaning so much that you should probably just not use it at all. The main issues to worry about are : compiler reordering and what the compiler does with your code, memory ordering constraints (cache/bus locking and flushing constraints), CPU reordering of instructions and memory timing vs cpu instruction timing.

It Goes To Eleven The Joys of Compiler and Processor Reordering Why You Technically Need Read-Side Barriers
Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Intel - Volatile : Almost Useless for Multi-Threaded Programming Memory ordering, barriers, acquire and release semantics niallryan.com
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Dr. Dobb's volatile vs. volatile January 8, 2009
Acquire and Release Semantics
Memory barrier - Wikipedia, the free encyclopedia
Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8 Visual C++ Language Visual C++ MSDN Forums

The MSDN stuff is all good :

Understanding Low-Lock Techniques in Multithreaded Apps
Rules and Limitations for TLS
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Synchronization and Multiprocessor Issues (Windows)
Concurrency What Every Dev Must Know About Multithreaded Apps
Break Free of Code Deadlocks in Critical Sections Under Windows
Under the Hood Reduce EXE and DLL Size with LIBCTINY.LIB
Multiple Approaches to Multithreaded Applications - Intel� Software Network

I think that "cmpxchg" on x86 with no lock prefix is equivalent to a C++0x compare_exchange(memory_order_acq_rel) , as opposed to "lock cmpxchg" which is memory_order_seq_cst :

Boehm, Hans - locks on X86
x86 synchronization primitives

This next link I believe holds the secrets to the mysteries of the x86 load reordering. All loads on the x86 are "acquire", which means they appear to go in order. (an "acquire" load means that later loads cannot move before the acquire, but earlier ones could move after it - when all loads are acquire it means they all go in order). However, we all know the x86 can speculatively execute loads out of order. WTF !? Something is seriously wrong with those two facts. In fact, loads are executed out of order, but if writes cause any changes during that time, they are invalidated and the loads are reissued :

details of x86 load reorder

This post gives you some idea of what the processors have to do to enforce the appearance of in-order loads. Also note : "Any memory access that falls within a single cache line is guaranteed to be atomic". (x86 specific)

Gustavo Duarte talks about Cache details

General note : I find it very valuable to understand what is going on under the hood. HOWEVER you should not write code that relies on the details of one processor or OS or compiler implementation. The code should be descriptive about the constraints that it needs, then you can have library routines which enforce that model for the specific OS/compiler/CPU.

Some general articles about how lock-free and CAS are not a panacea. It's important to remember that a CAS (any "lock" instruction) is a bus lock which is very slow - not just for your processor, but for all other processors looking at that cache line! In even moderately contended algorithms, the cost of CAS for N processors goes up like O(N). The means if you tried to implement a huge multiprocessor machine where people used Interlocked ops to synchronize, as you added more and more processors eventually the whole thing would be running in lockstep with every processor waiting on the bus all the time. In fact, a CAS-spin-retry is really fundamentally not much different than a short Critical Section.

Joe Duffy's Weblog - CAS Performance kills you
Dr. Dobb's Maximize Locality, Minimize Contention May 23, 2008
Dr. Dobb's Lock-Free Code A False Sense of Security September 8, 2008
What is the real cost of LOCK on x86 ?

Good article about how Windows SLIST deals with 64-bit pointers and the requirement to do CAS with only a 64-bit atomic in early x64 chips. With 32-bit pointers we often make a {pointer,counter} object in 64 bits and use cmpxchg8b. New chips have cmpxchg16b. Sadly there was a period when Intel made 64-bit chips with no cmpxchg16b.

Alex Ionescu�s Blog � Blog Archive � Behind Windows x64�s 44-bit Virtual Memory Addressing Limit

A bunch of Herb's junk : Herb develops a lock free queue (FIFO) which is basically just a SPSC queue, and then makes an MPMC variant by putting a lock on each end. This is a general technique - any SP or SC structure can be changed to MP or MC by adding a "producer lock" or "consumer lock". SP or SC work by knowing that only one owner is touching that end at a time. That "owner" can either be a specific thread, or it can be designated as the holder of the lock.

Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007
Dr. Dobb's Lock-Free Queues July 1, 2008
Dr. Dobb's Writing Lock-Free Code A Corrected Queue September 29, 2008
Dr. Dobb's Measuring Parallel Performance Optimizing a Concurrent Queue December 1, 2008

Big collections of articles :

Dr. Dobb's Go Parallel Features
Computer Laboratory -
AMD Developer Central - Articles & Whitepapers
Sun Scalable Synchronization Research Group
Sun Labs Scalable Synchronization Research Group Publications

Worker pools and queues :

Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
Re Non-blocking queue for the 2 threads
Joe Duffy's WSQ 1
Joe Duffy's WSQ 2
Joe Duffy's WSQ 3
Creating a thread safe producer consumer queue in C++ without using locks - Cluebat-man to the rescue

Some shit about singletons and such :

Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Double-checked locking - Wikipedia, the free encyclopedia
The Old New Thing C++ scoped static initialization is not thread-safe, on purpose!
The Old New Thing An auto-reset event is just a stupid semaphore

Major lock-free code bases :

Paul E. McKenney Read-Copy Update (RCU)
Atomic Ptr Plus Project
TBB Home
Download details Parallel Extensions to .NET Framework 3.5 June 2008 CTP
Getting Started with the .NET Task Parallel Library Multi-Core Case Studies
Intel� Performance Libraries - Intel� Software Network
Intel� Thread Checker 3.1 for Windows - Intel� Software Network
Intel� Thread Profiler 3.1 for Windows - Intel� Software Network
Synchronization Algorithm Verificator for C++0x - Intel� Software Network
transforming single-threaded allocators... - Intel� Software Network

Some specific code samples (WARNING : OFTEN WRONG !! I have yet to find any reliable source of lock-free code on the net that you can just grab and use)

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Kernel space Ticket spinlocks - LinuxWorld
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
CodeProject Single Producer Single Consumer, with no critical sections. Free source code and programming help
Lock-Free Linked List
SourceForge.net Spin-X Engine
Lock free double ended queue .mischief.mayhem.soap.
Single producersingle consumer bounded queue .mischief.mayhem.soap.
Lock-free, multi-threaded Mandelbrot .mischief.mayhem.soap.

ADDENDUM : Found some more goodies : Java Memory Model guide for compiler writers by Doug Lea. This is a very good overall discussion of memory models and how to implement them on various architectures.

Example implementation of C++0x on Power architecture by Paul McKenney.

fences in C++0x by Peter Dimov.

Joe Duffy on why critsecs need fences

Bartosz on the rare need for fences on x86


Part 2 : V'jukov and Thomasson

Now we get into the part that's actually useful. (note the are places where they "sketch" algorithms - obviously don't try to grab a sketch code snippet and stick it in your code and assume it works)

Any lock-free code that has not been tested with Relacy is basically worthless. Anybody can write lock-free code and it "seems to work" - it's shit, it's garbage, it's no help at all. The hard part is in the testing and absolutely being sure it's correct, just sketching up something that "looks right" can be done by any monkey.

Relacy Race Detector Google Groups

Dmitriy V'jukov who wrote Relacy also has lots of great posts around the net. The most valuable information we will find is by him and one other guy :

Chris Thomasson, author of AppCore : the code in AppCore is mostly ASM and x86 specific. Chris has lots of great writing on lock-free stuff around the net, some more links to him will follow. He believes that doing a lot of this in ASM is easier because you can see the exact ordering of ops and don't have to worry about what kind of madness the compiler is doing. I think he's right to some extent, I wind up spending half my time trying to ensure the compiler generates the right single instruction. For example his SPSC queue is incredibly simple and efficient.

AppCore A Portable High-Performance Thread Synchronization Library
vzoom refcount - mostly lock-free word-based atomic reference counting

Our two heroes often reside here :

Scalable Synchronization Algorithms Google Groups

Discussion with Thomasson where he describes how the Fober LIFO is broken and how to fix it : (also contains his x86 ASM for a LF LIFO and small mention of the fact that stack can touch freed memory, but in a benign way, so that just catching the access violation and making it a retry is a good fix) :

Solving the memory issue of lock-free stack

Thomasson sketch of work-requesting system : (unlike work-stealing , work-requesting lets you used a non-thread-safe stack for your own work items) :

VERY crude work-requesting environment... - comp.programming.threads Google Groups

Thomasson and V'jukov discuss the SPMC WSDQ of Hendler & Lev :

ABP stealing deque algorithm - comp.programming.threads Google Groups

Thomasson sketches a LF SPMC WSDQ (ignore all the junk posts at the top and scroll down to Thomasson and V'jukov) :

Single producer, one-shot work queue implementation, no lock on queue. - comp.programming.threads Google Groups

In this post Thomasson sketches the method of using an MPMC stack and just exchaning the head with NULL to grab the whole thing, then reversing it to make it FIFO instead of LIFO. This is a very efficient and simple way to do an MPSC FIFO. Also more interesting discussion by V'jukov at the bottom about garbage collection and such :

Wait free queue - comp.programming.threads Google Groups

A related thread - this starts off with the MPSC method using stack-swap, but then V'jukov drops a gem of a new MPSC method :

A Correct, Lock-free FIFO with a single CAS per complete putget cycle - Scalable Synchronization Algorithms Google Groups
V'jukov intrusive MPSC node FIFO here

Another WSDQ where Thomasson and V'jukov try to straighten things out :

Double ended queue - review - comp.programming.threads Google Groups

Again discussion starts off with mystification about the x86 details, and leads to Thomasson posting his SPSC and straightening out the clowns :

Memory barrier + volatile (C++) - comp.programming.threads Google Groups

V'jukov's very cool node allocator for FIFO queues that does no atomic ops at all ! (amortized). FIFO's are a very bad case for standard multi-threaded allocators like tcmalloc, because every single block is being allocated on one thread and freed on another, which requires them to do a cross-thread pass-back (or causes heavy slab contention).

Lock-free node allocator for fifo-queues - Scalable Synchronization Algorithms Google Groups

V'jukov's simple example of an SPSC queue :

Single-ProducerSingle-Consumer Queue - Intel� Software Network

V'jukov talks about CAS performance and issues of MPMC and other structures, mentions SList exception catching technique : (go to page 2)

Concurrent stack - Intel� Software Network

Finally : Thomasson's slab allocator

Dmitriy sometimes uses the opposite meaning of "head" and "tail" from how I think and it tilts the hell out of me. I always thing of "head" as where you push, and you pop from the tail. So like stacks only have a head, you push & pop the head, FIFOs you enqueue on the head and dequeue on the tail, he usually does the opposite. That is, in a singly-list linked I think of links as pointing from the head towards the tail, so if you walk links you get from the head to the tail.

ADDENDUM : Found some more goodies :

Thomasson describes vZOOM architecture

V'jukov Memory Passing Queue good for thread caching allocators passing blocks back to the main thread for example, but also other things.

2/22/2009

02-22-09 - Linkies

If you're not already following Drew and Ryan's Void Star Blog , get over and check out the awesome new video and Drew's nice render layer writeup.

Maciej posted this but it's so fucking great it's worth reposting : Gustavo Duarte writes about hardware and kernels with awesome diagrams.

Game Tunnel does a Monthly Round-Up of Indie Games that's not bad.

Peter Lindstrom - compression for graphics
Pedro V. Sander - parameterization, simplification, meshes, etc.
Very good.

Ishani.org - 2d level editor tools
This is a free 2d tile map editor for games (source code). I found it cuz I was searching for "Alien Breed" cuz the Void Star space game kind of vaguely reminds me of it.

compression.ru lossless image papers

bomb visual music
"Bomb" is one of the first interactive visual composition "instruments".

The Durrr Challenge is on. It's still early so not terribly exciting, but you can follow it here :
FT - durr challenge
Durrrr Challenge

2/19/2009

02-19-09 - Thread Safety Levels

I am trying to mark up my code with explicit comments about 3 levels of thread safety. I think this is a good concept that I haven't really seen discussed much :

Completely Thread-Safe (CTS) :

This function touches only local variables, objects through locks, objects which are intentionally & safely lock-free, TLS variables, and other stuff that is totally thread safe.

Object Thread-Safe (OTS) :

These functions can touch anything that is CTS, and also can touch any objects passed into them. If the objects passed into them are completely owned by the caller, then they are CTS. Class member functions should typically be OTS for example.

Not Thread-Safe (NTS) :

These functions touch globals or something and are just not thread safe. You must ensure they are run in sequential order.

So, for example, most of my init & shutdown code is NTS. I assume that you do inits, then start threads, then kill threads, then do shutdowns.

It would be really awesome if I could mark up functions in C++ with extra constraints. Then I could stick "CTS" on a function decl and the compiler could tell me if it does anything that doesn't comply with the CTS constraint. OTS can call CTS. NTS can call anything. If CTS calls NTS, it's a compile error.

Another thing that would sure be handy is a way to find all statics and globals. Fucking C++ has reused the reserved words so much, there's no word at all for globals, and "static" is used for so many things it's not a very useful search.

BTW another level that might be needed is Single-Thread-Safe :

These functions are thread-safe only if they are always called from the same thread. That does not necessarilly mean that they are only touching data which that thread exclusively owns - they may be touching shared data, but in a careful way that works only if only one thread is writing the data. One obvious example is the lock-free single-producer data structures. They are not really CTS because if you call Push on them from any thread but the owner you are borked.

02-19-09 - PNG Sucks

If you're a game developer and you want to ship your game with compressed lossless images (which you should want - it makes the install faster and smaller, and you take less space on disk, and your game loads faster because decompression is faster than disk IO - it's win win win) - PNG is a natural choice. The good thing about PNG is it's widely supported now, and it's relatively simple (though still ungodly complex for what it does). The bad thing is it really sucks for compression in silly ways.

One problem is that the actual LZ is Zip. Zip is really bad old technology. I know they used it cuz its own, blah blah. Anyway. You are probably already compressing your game content with some kind of LZ (you should be). Maybe CAB or LZMA. Both of those knock the living piss out of Zip. But if you Zip the image first (via PNG) then CAB or LZMA can't work on it. Instead, just leave your image alone - leave it in BMP - and let the CAB/LZMA/whatever do its LZ !

The only thing missing is the "DPCM" (silly name for doing delta from neighbors). Again PNG has some pretty awful choices for DPCM filters.

We could make a much better DPCM using a larger neighborhood, but for speed and simplicity we will only consider the 1-ring (like PNG), that is N, W, NW :


NW  N
W   ?

There are only two logical symmetric values that we can make from our neighborhood :

grad = N + W - NW;
avg  = (N + W)/2

"grad" is the gradient predictor, the simple planar fit. The set of all symmetric linear predictors can thus be specified by one parameter :

pred(t) = avg * t + grad * (1-t)

PNG provides pred(t) for t = 0 and 1, but 0.25 , 0.5 amd 0.75 are all very good values too (and can be implemented with shifts), and in fact they beat 0 and 1 very handily on most images.

Note that "grad" is actually a perfect predictor for horizontal and vertical stripes. That is :


A A
B ?

predicts B

B A
B ?

predicts A

The only time you would actually want a "N" or "W" pure predictor is in a weird case where pixels were correlated in one direction but not the other. For example if you had an image made from interlacing many images such that each horizontal line came from a different source image, then the "W" predictor would win. I have yet to find a single image where pred(t) is not best or very close to best.

Note that you could also obviously do an adaptive DPCM predictor in the style of CALIC that looks at the neighborhood edges and gradients and ranges and chooses different predictors. But that's an awful lot of complexity and would make it hard to put our simple DPCM into MMX or whatever.

So, this is what I'm roughly putting in Oodle for lossless texture compression. I just run a DPCM on the texture samples to turn them into deltas, and then I just jam the texture into my file stream, which LZ compresses everything. It's very easy to do, it's very fast, and it handily beats PNG.

ASIDE : this is only tangentially related, but the whole idea of the "planar" gradient predictor is very useful for predicting and extrapolating images. I used something similar in Galaxy3 to do normal map extrapolation (more about this in the next post). Obviously you can go past 1st-derivative prediction and do 2nd-derivative quadratic gradient prediction. I recently discovered a really sweet paper by Peter Lindstrom on Spectral Predictors which tabulates the best continuous predictor for various support shapes. (in image compression you always have the simple L-shape support because you are scanning in that order, but in other applications like normal map extrapolation you can have any possible support shape).

ADDENDUM : I should note there are like a million ways to beat this. That's not the point. The point is that it's *so* easy to do this, and it's very fast, so why not. Some obvious ways to beat it :

Divide image into 32x32 tiles and pick the best DPCM on each tile. To pick a DPCM do the least-squared best fit to find the optimal linear predictor. Bust least-squares is minimizing MSE which is not what you want, you want the minimum-rate predictor, so use MRP. Use more than a 1-ring neighborhood. etc. etc. (BTW none of this really changes the complexity of the decoder much, but you are definitely getting into the long compression tail of diminishing returns).

02-19-09 - Two Code Gems - Appendium

BTW similar to the array size thing is the Zero'er macro. It's very common to write a macro that will zero a struct out, like :

#define ZERO(ptr)  memset(ptr,0,sizeof(*ptr))

BITMAPINFOHEADER bih;
ZERO(&bih);

However, this ZERO() macro is very easy to use in catastrophically bad ways. (and no I'm not talking about invalidating C++ objects or anything - I'm assuming that you are using it only on things that *can* be zeroed, you just accidentally use it wrong).

Exercise : what happens in this code ?


DWORD test1[4] = { 1,1,1,1 };
char * test2[4] = { "","","","" };
char * test3 = "";

ZERO(test1);
ZERO(test2);
ZERO(test3);

(don't use a compiler to cheat).

And how do I write a better ZERO macro ? (I don't actually know the ideal answer to this question, so it's not entirely condescending rhetorical douchebaggery).

02-19-09 - Of Code and Old Boss Stories

I think every good programmer is defensive about their code. It's like a good chef or any good artist, if you find someone who doesn't care and doesn't stick up for themselves and doesn't get hurt by criticism - then they probably just suck. Of course this is also a dangerous and negative trait to some extent, you don't want to create walls around your code, you can be made much better with some outside help and pressure. (I've worked with people who would literally yell at you if you touched their code at all, and I've also worked with people who *intentionally* obfuscated the hell out of their code so that nobody else would want to touch it.)

I think I'm slightly more defensive than average. I'm happy to let people dig in my code, but it makese me let my attachment to it go. I guess I'm the same way about cooking or anything, like if I'm doing it the way I love to do it, and somebody else wants to do it some other way, okay, fine, do it your way, that's cool, but I no longer really care about this food that we're making and I'm not going to be really passionately involved anymore. I detach.

I almost have a love affair with code that I have written for my own devices. Code that is uncluttered by all the ugly practicalities of error checking and multiple platforms and compilers; code that is clean and pure and simple and elegant. It makes me happy just to think about certain snippets, and sometimes I like to just go and "carress" them by changing a variable name or slightly cleaning up things, it's just a little visit to let the code know I still care. If someone else starts digging their dirty hands in my beautiful code, I feel like she has betrayed me. You slutty code, fooling around with other programmers, fine, fuck you, I'm leaving you, I no longer care about you. I might still use you but I will no longer tend and carress you, and over time you will get hacky and bloated and buggy and krufty because many hands are in you and no one is refactoring you.

I was thinking this morning about two of the horror story bosses that I had long ago. They shall remain anonymous to protect their privacy. At the time they drove me fucking batty insane, and I was too young and green to just quit or yell at them, but now I find it very amusing actually.

The first one was at this little company in one of my first jobs. The boss man was actually super nice, I liked him, but he was older and not really doing much coding any more, which was fine. The problem was he liked to hang out at the office really late and night and drink beers in his office and talk to people on the phone and watch movies and such. And he would go dig around the source code during these late night drunken office sessions. Often I would come in in the morning and find a bunch of changes in the code that were just like "WTF !?". It really drove me nuts until I figured out what was going on, and then after that I knew just to do a diff every morning when I got in and revert most of the changes that happened in the night. (I was actually a pretty terrible coder then and was writing tons of bugs, so some of his changes actually were fixes to my work).

The second was probably worse. You've surely had bosses who are "code style nazis" ; well I had a boss who was literally on the IEEE committee for code formatting style standards. When I started working he gave me like a 20 page document on how to format your code correctly, the number of tabs that should go here, how this should be aligned to that, etc. I was like "umm, okay" , skimmed it and threw it in the bin. Little did I know he expected exact compliance to every single rule on every line of code. He would review every check in that I made and send me lists of formatting errors. This was not even like code style safety issues or "defensive programming" or anything like that, it was entirely about variable naming conventions, number of tabs, spaces, etc. I quit after 3 months.

02-19-09 - Two Code Gems

First one is from Mischeif/Mayhem (RDE) :

I, like many, have long used this array count :


#define ARRAY_SIZE(data)    (sizeof(data)/sizeof(data[0]))

However, it's very error prone. You can get errors just by accidentally using it wrong, but you also get "silent refactor errors" which is one of the things I really really hate. When I go and change a variable in a way that changes its function, I want all the code that relies on it to scream at me. Basically I want to be able to touch parts of the code without scanning every single line that uses that variable. For example you can use it like this :

    DWORD test1[4];
    char * test2[4];
    char * test3;
        
    COMPILER_ASSERT( ARRAY_SIZE(test1) == 4); // okay
        
    COMPILER_ASSERT( ARRAY_SIZE(&test1) == 4); // wtf
    COMPILER_ASSERT( ARRAY_SIZE(*test2) == 4); // wtf
    COMPILER_ASSERT( ARRAY_SIZE(test3) == 4); // wtf

Maciej has a cool template trick to make the ARRAY_SIZE macro more constrained so that it only works when you want it to work :


template < typename T, size_t N > char (&ArrayCountObj(const T (&)[N]))[N];
#define RDE_ARRAY_COUNT(arr)    (sizeof(rde::ArrayCountObj(arr)))

This is a very common template metaprogramming trick : the use of *types* as *values*. In order to basically make a function that returns N, you create a type that contains the value N (in this case a char array of size N). Then to turn that into a number, you use sizeof(), which is a device that can turn types into ints at compile time. Nice! C++ is a magic box that lets you make C the way it should be. (sometimes).

BTW on a semi-related coding style issue, say you have some initialized array and a count, like :


// expose in header :
extern const int c_numStrings;
extern const char * c_myStrings[];

Don't do this :

const int c_numStrings = 4;
const char * c_myStrings[4] = { "a", "b" "c" };

Instead do this :

const char * c_myStrings[] = { "a", "b" "c" };
const int c_numStrings = ARRAY_SIZE(c_myStrings);

or :

const int c_numStrings = 4;
const char * c_myStrings[] = { "a", "b" "c" };
COMPILER_ASSERT( numStrings == ARRAY_SIZE(c_myStrings) );

To ensure that you actually had the right number of initializers. This ensures that you set up the array right, and also that you keep the array in sync with whatever count it is supposed to correspond to.


(addendum : you can skip reading this whole thing and just go to Wikipedia which has made me her bitch)

Next one is random permutation. The right way to generate a permutation with all possibilities equally likely is :


for(int i=0;i < num;i++) array[i] = i;

for(int i=0;i < num;i++)
{
    int j = i + rand_count(num - i);
    swap( array[i], array[j] );
}

It's pretty easy to see this is correct because at each i you are drawing a number at random from the remaining possibilities, just like you would if you were drawing a number from a hat. (the last iteration is always a NOP so you could iterate up to just num-1).

That's important but well known (thanks Sean), but lets look at rand_count. rand_count(size) should return a random number in [0,size-1].

I'm going to assume you have a decent base random number generator called rand() , but it should not be the CLIB rand (may I suggest KISS99). Many people write rand_count like :


int rand_count(int size)
{
    int r = rand();
    return r % size;
}

Which sort of works, but there are some problems with this. First of all, with a lot of random number generators, the low bits are the worst. You could shuffle the bits to try to fix that. Second of all, modulo is really really slow (it's a divide). But perhaps worst of all - this is only an even distribution in [0,size-1] when size is very small !!

When size is close to the range of rand (15 bits in clib), it becomes very biased towards low numbers. For example if size is 20,000 , rand() can go up to 32768, and only 12768 values will wrap around, so the first 12768 are twice as likely as the remaining 7232 !!

Of course you could also do it by multiplication. Say your rand() made a 15-bit rand like clib does, then you can do :


int rand_count(int size)
{
    int r = rand();
    return (r * size) >> 15;
}

And that works, but has a nasty problem in that the multiplication can overflow if size is too big. (note that in that case the previous rand_count would be equally bad and just fail to ever return a big value). In many applications it would suck to have to check that size is too big, like say I'm testing something and want to grab a random byte from a file, I want to handle up to 2 GB files and not have to worry about my rand overflowing.

So the obvious answer is to go 64-bit and that also leads to a sweet optimization. rand32() here returns a full 32-bit random unsigned int :


unsigned int rand_count(int size)
{
    unsigned int r = rand32();
    return (unsigned int) ( ( (uint64) r * size) >> 32 );
}

And MSVC actually does the right thing here and actually generates just a plain "mul" and "ret edx". That is, it grabs the high 32 bit part of a 32-bit multiply directly and returns it, there's no 64-bit multiplies and no shift at all. (even if it didn't optimize perfectly, I would still use this because it just always works and I rarely need rand to be super duper fast, it's more important that it's correct)

BTW the multiplication method is using the top bits of rand32() , so you need a rand32() that makes good top bits (or good *all* bits). In fact if size = 2 this is literally returning the top bit. (this is actually a good thing for most simple linear PRNG's because the top bits are very good while the bottom bits are very bad).

BTW I use rand_count defined like this a lot because it's what you want to draw from an array, but I've never found a good name for this function, so help me out. It's always a little confusing that it returns in [0,size-1], not [0,size].

ADDENDUM : Sketch of randMod using retries :


// randMod returns in [0, range-1]
uint32 randMod( uint32 range )
{
    ASSERT( range > 0 ); // infinite loop

	// make a mask which the pow2 above range, minus one
    uint32 mask = range | (range>>1) | (range>>2) | (range>>3);
    mask = mask | (mask>>4);
    mask = mask | (mask>>8);
    mask = mask | (mask>>16);
    // bsr would be faster

    for(;;)
    {
        uint32 rand32 = myrand32();
        uint32 randInMask = rand32 & mask;
        ASSERT( randInMask < range*2 );
        // > 50% chance of passing the test so iterations should be rare
        if ( randInMask < range )
            return randInMask;
    }
}

2/18/2009

02-18-09 - Perfectionists are Good Bosses

I pulled Jeff in to show off my awesome thread work dispatcher that I was all proud of. I pull up my threadprofile viewer that shows the concurrency, and he immediately goes "what is this time gap at the end where you don't show anything happening?". I was like "hmm god dammit I'm trying to show off my cool shit here don't point out the flaws". But today I realized that on the main thread I was doing a sleep-loop to wait for the work to all get done, when really I should be using an event to wake the main thread when workers run out of work. With a sleep loop you can waste time up to the length of your sleep interval. With a wait event you wake instantly when its time to wake.

I already knew about the evils of sleep-loops and the goodness of events for doing thread waiting, I just got lazy in this particular loop and did it the sleep way cuz its easier. Perfectionist bosses can be annoying when you feel like you did something well enough, but they're also damn helpful, and they push you to be honest about the quality of your work. It's the role I used to play at OW, and I missed not having it when I was working on my own. Anyway I fixed it and the missing time gap disappeared and now I am happy. (this could also be a blog about the awesomeness of good tools - I never would have known about the time gap without the threadprofiler and viewer).

BTW about windows events : something that has been hammered home for me recently is that Windows manual reset events are just a race condition waiting to happen. !! Always use auto-reset events !!

For example, if the worker thread is sending you signals once in a while to let you know certain things have happened, if you have code like this in the main thread with a manual reset event :


WaitForSingleObject ( hEvent );
  (*) race here
ClearEvent( hEvent );

That's a race. If your thread is swapped out after the Wait but before the Clear, other threads can run along and signal the Event. Then it switches back to you and you clear the event and lose any signals that were set.

I have often used manual reset events in the past successfuly, because I use them to signal a shared data item has changed which is then protected by a critical section. As long as you take the critical section and use the shared data to reevaluate whether to wait, that pattern is safe from races even with a manual reset event (it doesn't matter than you may miss some event signals because you will see it when you check the data). But now that I am doing more lock-free stuff, it's more important to get the event right, and hell there's no reason to not just use auto-reset always, so do!! (when in doubt, be safer).

And now a teaser/brag. This is the performance of the work dispatcher :


linear        : 0.770 seconds, 2312158485 clocks   100.000% of linear time
1 threads     : 0.770 seconds, 2313504188 clocks   100.058%
2 threads     : 0.386 seconds, 1160138775 clocks    50.175% , speedup = 1.993 = 0.99650 per core
3 threads     : 0.258 seconds, 774830107 clocks     33.511% , speedup = 2.984 = 0.99467
4 threads     : 0.194 seconds, 582231143 clocks     25.181% , speedup = 3.971 = 0.99275
5 threads     : 0.194 seconds, 582824062 clocks     25.207%
6 threads     : 0.194 seconds, 583144837 clocks     25.221%
7 threads     : 0.194 seconds, 583397962 clocks     25.231%
8 threads     : 0.194 seconds, 582415282 clocks     25.189%

Making a work dispatcher with almost perfect scaling is not really that impressive (there are plenty existing around the net - and actually if you are Windows-only there are some good ones built directly into Windows), but there are some things that I am happy about :

The loss from going from single-thread linear to dispatched in packets is very close to zero.

The loss from running lots of small packets as opposed to a few big ones is very close to zero. (I tried packets of between 4 and 64 items and it made almost no difference).

There's no slowdown when you have too many threads (obviously this is a 4 core machine), which is a common flaw in broken work dispatchers (they start getting slower and slower with too many threads because they have contention issues and they fight each other).

(I should say this testing is on a quad-core chip, which doesn't show scalability problems nearly as much as a true 4-proc machine with separate caches, so there may still be scalability flaws I don't know about)

2/16/2009

02-16-09 - Low Level Threading Junk Part 2.5

Back to threading. See the old posts to catch up . We talked a bit about memory models and singletons and such in the last section. One thing we haven't really talked about is the value of TLS (Thread Local Storage). In our singleton discussion we mentioned that the C++ "static" is a huge dangerous trap that should be avoided. In fact, all statics and globals are dangerous and must either be eliminated or thread protected.

One way to fix statics/globals nicely is to make them thread-local. In fact, I would argue that statics/globals should be thread local by default, and only with great care should you make them not thread local. Basically access to TLS variables is totally fast and safe and you don't need to worry about threading issues as long as you only work on TLS values. It's a good pattern to have minimal thread communication areas, and then copy what you need into your TLS. (of course there are other ways to do "thread local" data other than TLS, such as just by convention - like you know that a certain global is only touched by one thread, but that requires care and commenting).

The TLS is a block of memory associated with each thread that you can index into to get at variables; it's analogous to the statics section of a normal exe. Now there are functions to manually get access to the TLS but there's basically no reason to use those, and I don't recommend it. Every compiler on every platform now provides something like "declspec(thread)" which declares a variable to go in TLS. You just define it like :


__thread int thisVarIsInTLS;

If you want to know a lot about the details of how TLS works, I found a great set of articles by Nynaeve all about how TLS is implemented in Windows :

Nynaeve � Blog Archive � Thread Local Storage, part 1 Overview
Nynaeve � Blog Archive � Thread Local Storage, part 2 Explicit TLS
Nynaeve � Blog Archive � Thread Local Storage, part 3 Compiler and linker support for implicit TLS
Nynaeve � Blog Archive � Thread Local Storage, part 4 Accessing __declspec(thread) data
Nynaeve � Blog Archive � Thread Local Storage, part 5 Loader support for __declspec(thread) variables (process initializatio
Nynaeve � Blog Archive � Thread Local Storage, part 6 Design problems with the Windows Server 2003 (and earlier) approach to
Nynaeve � Blog Archive � Thread Local Storage, part 7 Windows Vista support for __declspec(thread) in demand loaded DLLs
Nynaeve � Blog Archive � Thread Local Storage, part 8 Wrap-up

One important thing to note : MSVC supports CINIT for TLS. That means you can use real C++ classes in TLS, and what happens is each time a thread is created, it runs the "cinit" list for those objects. This is *not* widely supported by other compilers, so if you want to write code that is widely portable you should use only basic C types in TLS. It's a bit of a shame because having the cinit and shutdown for TLS would be handy. Instead you have to make your own "shim" or "thunk" routine around your thread to init and shutdown objects.

So far as I can tell, every major compiler on every major platform supports something like "declspec thread" or "__thread". There are some nice advantages to doing this over using manual TLS allocation and indexing. For one thing the compiler can write much better code to access it since the offsets are constants not variables, and it can make use of the direct pointer to the TLS struct. More importantly, you don't need to worry about the initial set up of the TLS indeces in a thread-safe single-instantiation way (which we saw in earlier articles is a pain).

Next post will be on some details of lock-free algorithms.

BTW while I'm posting big blog series, I gathered Larry Osterman's series on Concurrency. This series is not great, there are some errors, but it is pretty comprehensive, so :

Concurrency, Part 1
Concurrency, Part 2 - Avoiding the problem
Concurrency, Part 3 - What if you can't avoid the issue
Concurrency, Part 4 - Deadlocks
Concurrency, part 5 - What happens when you can't avoid lock misordering
Concurrency, part 6 - Reference counting is hard )
Concurrency, part 7 - Why would you ever want to use concurrency in your application
Concurrency, Part 8 - Concurrency for scalability
Concurrency, Part 9 - APIs that enable scalable programming
Concurrency, Part 10 - How do you know if you've got a scalability issue
Concurrency, Part 11 - Hidden scalability issues
Concurrency, part 12 - Hidden scalability issues, part 2
Concurrency, Part 13 - Concurrency and the CLR
Concurrency, Part 14 - Odds and Ends
Concurrency, Part 15 - Wrapping it all up.
Concurrency, part way too many1 - concurrency and the C runtime library.
Concurrency, part way too many2 - Concurrency and the Win32 API set
Concurrency, part way too many3 - Concurrency and Windows

02-16-09 - Amusing Word Problem

Found this and it confused me for a second :

in a country in which people only want boys, every family continues to have children until they have a boy. if they have a girl, they have another child. if they have a boy, they stop. what is the proportion of boys to girls in the country?

Please settle this argument for me. Will it be 50/50?

(assume the a priori chance of boy or girl is 50/50 , and each family stops at its first boy).

old rants