cbloom rants: 03-02-09 - Low Level Threading

Wrap up and Relacy

I know this stuff is scary, but I believe it's necessary and it's time to embrace it. In the coming years, we will all be writing multi-threaded code. Most of you are probably writing lock free code already. I believe that using solid lock free data structures is the right way to do this. It lets you write them carefully, understand them, profile them, test them, simulate them in Relacy, and then you can be very confident about them, and the client code that uses them is pretty simple and doesn't have to worry too much about weird threading issues.

Now, if you like, you can use message passing between threads with locking. In fact, I encourage you to leave that as an option. In fact in all my code I have #defines to switch the lock-free queue with just a plain old critsec locking queue. Furthermore, I have #defines to remove threading altogether (and just pump the thread routine periodically from the main thread). This lets me debug problems without threading when possible. If the locking queue overhead is fine for you, then go ahead and use it.

People who don't like the lock-free data structures usually wind up using one of two alternatives : 1. "ad-hoc" multithreading, or 2. "parallel for". I believe both of these are wrong in different ways.

"ad-hoc" multi-threading is in widespread use currently in game development. This involves a whole host of dangerous techniques, such as just setting variables in one thread and checking them in another, or randomly tossing in some Interlocked ops to make things "safe". This stuff scares the shit out of me, is not generally well understood or isolated, and can cause really evil bugs. I strongly encourage you all to stop doing this. You are highly likely to be writing rare race conditions when you do this, that will be very hard to find or reproduce. Any time you communicate between threads without using a critsec you are writing lock-free code. Just because you didn't use a lock-free FIFO or something doesn't mean you made it simpler.

"parallel for" is being pushed by a lot of people as a solution to multi-threading. Don't get me wrong, I think "parallel for" is a perfectly good thing - however, it is a very high level abstraction that is not capable of addressing the hard threading problems. It basically only has one use - taking some large task that is very easy to seperate into work pieces, doing them on threads, then waiting for them all to be done. That is a common pattern, especially in tools, but it is only one specific work pattern - and it is BY FAR THE EASIEST. I think it's a huge mistake to have these big software products (like .NET Parallel Extensions, Open MP, Intel TBB, etc.) all basically targetted at "parallel for" type work patterns - because that is a *leaf* usage pattern. Threading libraries should be targetted at the *core* usage pattern, and then you can build your own leaves on it.

Basically "parallel for" is just a syntactic sugar for wrapping up independent operations into "Worklets" that can be run on Workers in a thread pool job scheme. I don't need syntactic sugar. Converting a for loop into Worklets is not the hard part of threading. I'm perfectly happy to write a little code to turn my for loop into a function pointer that can be run from the worker swarm. If somebody gave you a bunch of lock-free queues and lock-free hash maps and cross-platform signals and events, you could code up a job swarm parallel-for thing in an afternoon. I'm serious. It's just not the hard part of the problem at all.

And it's not very useful in realtime apps. We have threads that run together over long periods of time and need to check on each other's status and coordinate latency and timing, etc.

Okay, now let's talk about how you establish confidence in your threading routines. The only way is through exhaustive testing. And I don't mean some little unit test you code up. I mean Relacy Race Detector or an equivalent.

In all my lockfree code, I avoid directly doing things like spinning or even storing variables. I call functions like "LoadAcquire" or "StoreRelaxed" or "SpinBackOff". That way I can just change those for each platform and the code should stay the same. Another advantage is that I can change these functions to use C++0x or Relacy.

Relacy Race Detector is a dynamic thread race checker. It's based on the C++0x model, so adapting the code to work in Relacy basically just means making it use the C++0x model, plus a bit of extra juju.

Relacy replaces the C++0x types like std::atomic with its own versions. The Relacy objects contain a lot of extra information, like what thread last read & wrote the value, and a history of what the value has been over the past. It can tell you things like reading uninitialized variables, data races, etc.

Relacy simulates the possible thread execution orders. It does this by testing thread switches at all the "sync points" in the code. (sync points are places you touch shared variables). Normal C code that you write is assumed to touch only thread-local stuff - it is run without checking. If you touch either a std::atomic or an rl::var (in my code these are ATOMIC_VAR or NOT_THREAD_SAFE) it creates a sync point. (addition syncs are at any mutex entry/exit, at thread yield, and at memory fences).

The sync points automatically invoke the Relacy scheduler which might do a thread switch. In Relacy you can easily see all the sync points because they have a "$" at them. (the $ is actually just a #define to debug file/line, but it's useful as a quick visual way to see all possible thread switches).

Relacy actually runs your code using fibers. It makes a fiber for each thread and then does the switching between fibers manually. That means that in a Relacy test, your code actually runs single threaded. Plain C code that is supposed to be thread-local is run atomically (with no thread switches) which is nice for debugging.

Relacy has two main simulation modes : "context_bound" and "random". Random is obviously a monte carlo simulation. At each sync point it randomly picks a thread switch. You want to run 10^8 or so iterations of random, and you should make the test pretty complex to try to stress the algorithms. "context_bound" is a full enumeration of all possible program execution orders (bounded by the context limit, which is some kind of tree depth limit). A test with context_bound can find very obscure races that you might never see in a million years of real world testing, such as switching back and forth between two threads on exactly alternating lines of code or whatever. Generally you want to run a very simple test with context_bound such as just doing one push and one pop.

Note that Relacy does not check that your code is actually producing the right output - you need to do that yourself using RL_ASSERT. What Relacy checks is that the code is 1. free of obvious bugs like races, memory leaks, etc., and 2. produces the same output every time regardless of thread switch pattern.

This #2 is obviously crucial - it's the whole goal of writing correct threaded code : the action is the same regardless of the execution thread switch pattern. Obviously to check this, your program must be deterministic. That means, no using rand (use rl::rand instead), no using statics (use test_suite::before() or test_suite::invariant() instead). (note that if you use rl::rand, all possible rand values will be simulated, so generally use a small count).

Relacy also tracks the full execute state and can save and restore the state at any point. Basically it journals the entire thread switch history. I haven't actually used this yet, but it could be awesomely handy for finding really tricky threading bugs.

Personally I believe that lock free code that has not been thoroughly tested with Relacy is toxic and I want nothing to do with it. (as I've written previously, if you actually plug a lot of the respected published lock-free algorithms into Relacy you will find races) (BTW another warning : do not just grab the example code that comes with Relacy and try to use it in your project - some of Dmitriy's examples are intentionally broken so that you can see how Relacy finds errors).

Also, there are some other race checkers out there. CHESS by Microsoft Research is an okay one if you are on Windows. The Intel Thread Checker is similar. I believe they are both inferior to Relacy. There are also some that work by static analysis; the static analysis tools generally work by having you write up your program in a custom syntax. I haven't tried these but I don't like the idea of having to translate between a custom syntax, because it creates the possibility that the code as written in the custom tool is okay, but after translating into real code it's broken. The ideal thread checker IMO works directly on code in its real syntax so there is no translation danger, and is also comprehensive, not sampling.

You may now download code to my lock free data structures in Relacy wrapper : (MSVC >= 7)

myrelacy.zip (111k)

(this is based on Relacy 1_3_0 ; the official version of Relacy 1_3_0 has some bugs that are fixed in here; Dmitriy will be releasing an official fixed version soon)

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :


    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   12  *  THIS SOFTWARE IS PROVIDED BY THE OWNER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
   13  *  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
   14  *  IN NO EVENT SHALL THE OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
   15  *  OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   16  *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
   17  *  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
   18  *  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.

cbloom rants

3/02/2009

03-02-09 - Low Level Threading - Part 6

No comments:

old rants