3/07/2009

03-07-09 - Malloc

Arseny Kapoulkine has a nice new blog post about replacing CRT malloc .

I looked into it a while ago and gathered these links on the subject :

What your mother never told you about graphics development Fighting against CRT heap and winning
TRET The Reverse Engineering Toolkit
The Bag of Holding
somecode Memtracer
Practical Efficient Memory Management EntBlog
New Fun Blog - Scott Bilas Figuring out how to override malloc and new
MemTracer strikes back .mischief.mayhem.soap.
Lightning Engine � Blog Archive � Tracking Memory Allocations
Gamasutra - Monitoring Your PC's Memory Usage For Game Development
Fast File Loading EntBlog
Detours - Microsoft Research
CodeProject Visual Leak Detector - Enhanced Memory Leak Detection for Visual C++. Free source code and programming help
A Cross-Platform Memory Leak Detector

then I decided it was a pain in the ass and I wasn't going to do it. This is one place where C++ got it so much better - you can just override new/delete and it all works. I never use malloc/free anyway, so WTF the few people who do use them can go to the CRT heap, what do I care?

The new cblib has a simple lock-free paged small block allocator. It's very primitive and not as fast or efficient as it could be (it's very fast for single threaded but has very poor cache line sharing behavior), but on the plus side it is nice and simple and self-contained unlike tcmalloc or hoard or something which are like a huge mess of code.

BTW related to this all and the threading stuff is : LeapHeap

Hmm these were in that link bag and don't belong, but they are quite fun -
HAKMEM -- CONTENTS -- DRAFT, NOT YET PROOFED
Hacker's Delight

3/03/2009

03-03-09 - Oodle Update

Oodle currently has two modes - "Incremental Build" and "Final". In Final mode it's assumed that you have made all the packs the way you want, and all the content is valid. This is presumably how you will actually ship your game. In Final mode, I don't check bundles against the original files or any of that stuff, it's much simpler and faster.

The problem of course is that Final is a big pain during development. Incremental lets you change any individual file at any time. It lets you delete any of the packed bundles. It automatically tries to use the best packed bundle - but only if it contains the latest version of the file. As I've written about before, we can't use modtime for this reliably, so I now use the hash of the source file (and use modtime to tell when I need to recompute the hash).

The goal of the incremental build system is that you can just do whatever you want during dev, and everything "just works" - you never see the wrong version fo the content or anything weird. I consider it better to be correct and robust than to be hyper efficient. But I also have the simultaneous goal that performance should be as close to the Final mode as possible. There are two big reasons for this - one is just speed, you still want your loads and iteration to be fast during development, but the other is even more significant : the way you run during dev should be as close as possible to the way you run when you ship. It also hopefully makes it easier to write code for Oodle, because you just write to one API and it automatically acts the right way.

Okay, that all sounds good, but dear god it gets complex in practice.

The general thing that makes it really tricky is I want to allow the client to do absolutely anything to their content at any time, without telling me, and things should still work. Oodle maintains a lightweight database of what it thinks of the content, but at any time that database can be wrong because the client may have done things without telling me. So I constantly need to be verifying it and reupdating it.

For example, clients might go and change source content while the game's not running. When the game runs next I want it to automatically update to the newer source content. (if they change it while the game is running, I see it and reload immediately). Clients might just delete bundles. They might sync to bundles that were made on a server. They might run the bundler tool and add or remove files from individual bundles. The next time the game runs it should just all work (and it shouldn't have to spin over every file and load everything up and validate it either).

I think I have this largely fixed for most normal cases. Basically it involves a lot of "lazy evaluation" of database correctness; as things are requested or brought in, inconsistencies are detected and fixed. A typical load pattern goes like this :

Client code asks to make resource A resident. (there are various ways that clients can make resources resident, this is the "loosest" path - you could also manually tell a whole Bundle to come resident, or you could ask for a named "paging set" to come resident).

First I check if A is already resident and just give it back. But if there's been an error in the previous load of A, I invalidate the database entry that was used and don't return the bad version.

I see if A has a valid map in the database. If so I try to use one of the existing bundles. Now there's a heuristic for which bundle to load. If any of the bundles that contain A is already in memory or already in the progress of loading, I use that. Else, if A is in a header-aggregate bundle, I use that to get the header. (header aggregates contain the headers of lots of resources and generally are made resident very early so you can get the headers before the data comes in). Next if A is in a compiled bundle, I use that (compiled bundles are presumably made by the packing tool so are preferred if possible). Finally I fall back to the single bundle.

If there is no valid database map, I invoke the "baker" for A which knows how to load the data raw and turn it into a single bundle. This is the only pathway that knows how to load raw data, and this is where the client must plug in their file format parsers.

That all catches a lot of cases. For example, if some bundle has been corrupted, I will go ahead and fire the load, when the load fails I will invalidate the map from the resource to that bundle. If you retry the load, it will see there's no mapping to a bundle and fire the baker to make a new one.

When bundles load, they contain various objects, it should contain the resource that caused the load request, but it also might contain others. Those others may have also been previously loaded another way. Our new data may or may not be inteded to replace the old data. The heuristic that I use is that the raw source file on disk is always authoritative. I believe that is intuitive and what artists and users would expect - if you have a certain BMP on disk, then that is what you will see in the game. If you have some packed bundles that contain older versions of that file, they will be ignored - when they try to register their objects they will be rejected. If you have packed bundles that contain newer versions (eg. from a server or other developer) those will also be ignored if you don't have the newer source files, I'm not sure if this is good or bad, but it's an inevitable consequence of not trusting mod time.

That's all relatively simple (LOL?).

One problem remains - because I'm primarily keying off the source file hash, I can't distinguish between different processings of the same original source file. This happens if you're doing something nontrivial in the baker, which I believe is a common case. The baker might convert textures to DXTC, it might optimize geometry for vertex cache, that kind of stuff. You might want to have a "fast bake" that just gets the data into some useable format, and a "heavy bake" that really does a good job. If there are bundles which contain the content in various levels of bake, I always want to load the heavier bake.

Currently I'm using the heuristic that newer bakes are "heavier" , so I always prefer the newest bundle that contains the valid content. But there are various problems with that. One is that of course it's not necessarily correct and modtime isn't reliable. But even if it *was* always correct, it wouldn't work for me, because I can't tell when two bakings are the same. eg. say I load up bundle X that contains resources {A,B,C}. Now client asks for resource C. I see that I already have it because I loaded bundle X. But I also see in the game content map that bundle X is not the newest bundle for resource C. There's some newer bundle Y. Can I return the X->C that I already have? Or do I have to fire a load of Y to get the newer version? I can't tell.

I guess the easy solution is to add an extra field to the baking of each resource that indicates the baking level, and require the client to manually set that field. I guess that field should really be a bit-field so that the client can mark up to 32 types of processing that resource might have and then I insist that you get the "most processed" resource at any time.

I hate exposing more fields that the client has to set right for things to work. My goal is that all this complexity is completely hidden, the client doesn't really have to worry about it, you just do things and it all magically works.


Semi-related : I wrote a while ago about the speeds that I think you should have with an LZ decompressor for file IO. In that post I assume that the game has some CPU work to do for loading, so the CPU used by the decompressor is stealing from someone else.

It turns out that's changing fast. On a new 8-core chip, it's highly unlikely that the game is actually using all the CPU time *ever* , and even more unlikely that it's using it during load. That means CPU is basically free, which totally changes the equation. Having a slow decompressor like LZMA might still hurt your *latency* but it will improve your throughput.

Of course this only true on modern CPUs. We sort of have a problem now with CPUs similar to the problem with GPUs - there's a huge difference in capabilities between high spec and low spec, and the way you target them is quite different. The biggest step is probably from 1 core to 2 core, because the difference between having a true simultaneous background thread and just switching with one is huge. But the difference from 2 to 8 is also pretty huge. With 8 cores you can do a *lot* very fast in the background. For example you could just do masses of decompression and procedural content generation.

Steam Hardware Survey shows 1 cpu is going away pretty fast, so hopefully soon we will be able to target min specs of >= 2 cpus and >= dx9.

3/02/2009

03-02-09 - Low Level Threading - Part 6

Wrap up and Relacy

I know this stuff is scary, but I believe it's necessary and it's time to embrace it. In the coming years, we will all be writing multi-threaded code. Most of you are probably writing lock free code already. I believe that using solid lock free data structures is the right way to do this. It lets you write them carefully, understand them, profile them, test them, simulate them in Relacy, and then you can be very confident about them, and the client code that uses them is pretty simple and doesn't have to worry too much about weird threading issues.

Now, if you like, you can use message passing between threads with locking. In fact, I encourage you to leave that as an option. In fact in all my code I have #defines to switch the lock-free queue with just a plain old critsec locking queue. Furthermore, I have #defines to remove threading altogether (and just pump the thread routine periodically from the main thread). This lets me debug problems without threading when possible. If the locking queue overhead is fine for you, then go ahead and use it.

People who don't like the lock-free data structures usually wind up using one of two alternatives : 1. "ad-hoc" multithreading, or 2. "parallel for". I believe both of these are wrong in different ways.

"ad-hoc" multi-threading is in widespread use currently in game development. This involves a whole host of dangerous techniques, such as just setting variables in one thread and checking them in another, or randomly tossing in some Interlocked ops to make things "safe". This stuff scares the shit out of me, is not generally well understood or isolated, and can cause really evil bugs. I strongly encourage you all to stop doing this. You are highly likely to be writing rare race conditions when you do this, that will be very hard to find or reproduce. Any time you communicate between threads without using a critsec you are writing lock-free code. Just because you didn't use a lock-free FIFO or something doesn't mean you made it simpler.

"parallel for" is being pushed by a lot of people as a solution to multi-threading. Don't get me wrong, I think "parallel for" is a perfectly good thing - however, it is a very high level abstraction that is not capable of addressing the hard threading problems. It basically only has one use - taking some large task that is very easy to seperate into work pieces, doing them on threads, then waiting for them all to be done. That is a common pattern, especially in tools, but it is only one specific work pattern - and it is BY FAR THE EASIEST. I think it's a huge mistake to have these big software products (like .NET Parallel Extensions, Open MP, Intel TBB, etc.) all basically targetted at "parallel for" type work patterns - because that is a *leaf* usage pattern. Threading libraries should be targetted at the *core* usage pattern, and then you can build your own leaves on it.

Basically "parallel for" is just a syntactic sugar for wrapping up independent operations into "Worklets" that can be run on Workers in a thread pool job scheme. I don't need syntactic sugar. Converting a for loop into Worklets is not the hard part of threading. I'm perfectly happy to write a little code to turn my for loop into a function pointer that can be run from the worker swarm. If somebody gave you a bunch of lock-free queues and lock-free hash maps and cross-platform signals and events, you could code up a job swarm parallel-for thing in an afternoon. I'm serious. It's just not the hard part of the problem at all.

And it's not very useful in realtime apps. We have threads that run together over long periods of time and need to check on each other's status and coordinate latency and timing, etc.

Okay, now let's talk about how you establish confidence in your threading routines. The only way is through exhaustive testing. And I don't mean some little unit test you code up. I mean Relacy Race Detector or an equivalent.

In all my lockfree code, I avoid directly doing things like spinning or even storing variables. I call functions like "LoadAcquire" or "StoreRelaxed" or "SpinBackOff". That way I can just change those for each platform and the code should stay the same. Another advantage is that I can change these functions to use C++0x or Relacy.

Relacy Race Detector is a dynamic thread race checker. It's based on the C++0x model, so adapting the code to work in Relacy basically just means making it use the C++0x model, plus a bit of extra juju.

Relacy replaces the C++0x types like std::atomic with its own versions. The Relacy objects contain a lot of extra information, like what thread last read & wrote the value, and a history of what the value has been over the past. It can tell you things like reading uninitialized variables, data races, etc.

Relacy simulates the possible thread execution orders. It does this by testing thread switches at all the "sync points" in the code. (sync points are places you touch shared variables). Normal C code that you write is assumed to touch only thread-local stuff - it is run without checking. If you touch either a std::atomic or an rl::var (in my code these are ATOMIC_VAR or NOT_THREAD_SAFE) it creates a sync point. (addition syncs are at any mutex entry/exit, at thread yield, and at memory fences).

The sync points automatically invoke the Relacy scheduler which might do a thread switch. In Relacy you can easily see all the sync points because they have a "$" at them. (the $ is actually just a #define to debug file/line, but it's useful as a quick visual way to see all possible thread switches).

Relacy actually runs your code using fibers. It makes a fiber for each thread and then does the switching between fibers manually. That means that in a Relacy test, your code actually runs single threaded. Plain C code that is supposed to be thread-local is run atomically (with no thread switches) which is nice for debugging.

Relacy has two main simulation modes : "context_bound" and "random". Random is obviously a monte carlo simulation. At each sync point it randomly picks a thread switch. You want to run 10^8 or so iterations of random, and you should make the test pretty complex to try to stress the algorithms. "context_bound" is a full enumeration of all possible program execution orders (bounded by the context limit, which is some kind of tree depth limit). A test with context_bound can find very obscure races that you might never see in a million years of real world testing, such as switching back and forth between two threads on exactly alternating lines of code or whatever. Generally you want to run a very simple test with context_bound such as just doing one push and one pop.

Note that Relacy does not check that your code is actually producing the right output - you need to do that yourself using RL_ASSERT. What Relacy checks is that the code is 1. free of obvious bugs like races, memory leaks, etc., and 2. produces the same output every time regardless of thread switch pattern.

This #2 is obviously crucial - it's the whole goal of writing correct threaded code : the action is the same regardless of the execution thread switch pattern. Obviously to check this, your program must be deterministic. That means, no using rand (use rl::rand instead), no using statics (use test_suite::before() or test_suite::invariant() instead). (note that if you use rl::rand, all possible rand values will be simulated, so generally use a small count).

Relacy also tracks the full execute state and can save and restore the state at any point. Basically it journals the entire thread switch history. I haven't actually used this yet, but it could be awesomely handy for finding really tricky threading bugs.

Personally I believe that lock free code that has not been thoroughly tested with Relacy is toxic and I want nothing to do with it. (as I've written previously, if you actually plug a lot of the respected published lock-free algorithms into Relacy you will find races) (BTW another warning : do not just grab the example code that comes with Relacy and try to use it in your project - some of Dmitriy's examples are intentionally broken so that you can see how Relacy finds errors).

Also, there are some other race checkers out there. CHESS by Microsoft Research is an okay one if you are on Windows. The Intel Thread Checker is similar. I believe they are both inferior to Relacy. There are also some that work by static analysis; the static analysis tools generally work by having you write up your program in a custom syntax. I haven't tried these but I don't like the idea of having to translate between a custom syntax, because it creates the possibility that the code as written in the custom tool is okay, but after translating into real code it's broken. The ideal thread checker IMO works directly on code in its real syntax so there is no translation danger, and is also comprehensive, not sampling.


You may now download code to my lock free data structures in Relacy wrapper : (MSVC >= 7)

myrelacy.zip (111k)

(this is based on Relacy 1_3_0 ; the official version of Relacy 1_3_0 has some bugs that are fixed in here; Dmitriy will be releasing an official fixed version soon)

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :


    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   12  *  THIS SOFTWARE IS PROVIDED BY THE OWNER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
   13  *  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
   14  *  IN NO EVENT SHALL THE OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
   15  *  OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   16  *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
   17  *  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
   18  *  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.

03-02-09 - Sleep Sucks and VSync Woes

Ok, first let me describe the issue :

You want to hit 60 fps (or whatever) pretty reliably. Each frame there is some amount of work you *must* do (eg. rendering, respond to input), and there is some amount of work that you would like to do (eg. streaming in data, decompressing, etc.). You want to do as much of the optional work as possible each frame, without using too much time.

You would also like to behave well with other apps. eg. you don't want to just spin and eat 100% of the CPU when you're idle. (this is especially important for casual games where the user is likely browsing the web while playing). To do this you want to Sleep() or something with your idle time.

So, ideally your frame loop looks something like :


-- page flip --

start = timer();

[ do necessary work ]

remaining = start + frame_duration - timer()

[ do optional work as remaining time allows ]

remaining = start + frame_duration - timer()

[ if remaining is large , sleep a while ]

-- page flip --

You want the duration of these three things to add up :

(necessary work) + (optional work) + (sleep time) <= frame_duration

but as close to frame_duration as possible without going over

Okay. There is one specific case where this is pretty easy : if you are in full screen, using V-sync page flips, *AND* your graphics driver is actually polite and sleeps for V-sync instead of just spinning. I don't know what the state of modern drivers is, but in the past when I looked into this I found that many major companies don't actually sleep-to-vsync on Windows, they busy spin. (this is not entirely their fault, as there is not a very good vsync interrupt system on windows).

Sleep-to-vsync is important not just because you want to give time to other apps, but because the main thread wants to give time to the background threads. eg. if you want to put streaming on a lower priority background thread, then you write your loop just like :


-- page flip --
[ do necessary work ]
-- request page flip --
    puts main thread to sleep
    [ optional work runs until vsync interrupt switches me back ]
-- page flip --

and you assume the page flip will sleep you to vsync which will give the background loader time to run.

Even if your driver is nice and gives you a sleep-to-vsync, you can't use it in windowed mode.

A lot of games have horrible tearing in windowed mode. It seems to me that you could fix this by using the Dx9+ GetRasterStatus calls. You simply don't do a flip while the raster scan line is inside your window (or right near the top of your window). You wait for it to be past the bottom or in the vblank area or far enough above you. That should eliminate tearing and also make you hit the frame time more exactly because you sync to the raster.

(if you just use a timer without using the raster you can get the worst possible tearing, where a flip line scans up and down on your image because you're just barely missing sync each frame).

But we still have the problem that we can't sleep to vsync, we basically have to spin and poll the raster.

Now you might think you could look at the remaining time and try to sleep that amount. Ha! Sleep(millis) is a huge disaster. Here's a sampling of how long you actually sleep when you call Sleep(0) or Sleep(1) :


sleepMillis : 0 , actualMillis : 28.335571
sleepMillis : 0 , actualMillis : 22.029877
sleepMillis : 0 , actualMillis : 3.383636
sleepMillis : 0 , actualMillis : 18.398285
sleepMillis : 0 , actualMillis : 22.335052
sleepMillis : 0 , actualMillis : 6.214142
sleepMillis : 0 , actualMillis : 3.513336
sleepMillis : 0 , actualMillis : 15.213013
sleepMillis : 0 , actualMillis : 16.242981
sleepMillis : 0 , actualMillis : 8.148193
sleepMillis : 1 , actualMillis : 5.401611
sleepMillis : 0 , actualMillis : 15.296936
sleepMillis : 0 , actualMillis : 16.204834
sleepMillis : 1 , actualMillis : 7.736206
sleepMillis : 0 , actualMillis : 3.112793
sleepMillis : 0 , actualMillis : 16.194344
sleepMillis : 0 , actualMillis : 6.464005
sleepMillis : 0 , actualMillis : 3.378868
sleepMillis : 0 , actualMillis : 22.548676
sleepMillis : 0 , actualMillis : 4.644394
sleepMillis : 0 , actualMillis : 28.266907
sleepMillis : 0 , actualMillis : 51.839828
sleepMillis : 0 , actualMillis : 7.650375
sleepMillis : 0 , actualMillis : 19.336700
sleepMillis : 0 , actualMillis : 5.380630
sleepMillis : 0 , actualMillis : 6.172180
sleepMillis : 0 , actualMillis : 30.097961
sleepMillis : 0 , actualMillis : 17.053604
sleepMillis : 0 , actualMillis : 3.190994
sleepMillis : 0 , actualMillis : 27.353287
sleepMillis : 0 , actualMillis : 23.557663
sleepMillis : 0 , actualMillis : 27.498245
sleepMillis : 0 , actualMillis : 26.178360
sleepMillis : 0 , actualMillis : 29.123306
sleepMillis : 0 , actualMillis : 23.521423
sleepMillis : 0 , actualMillis : 6.811142

You basically cannot call Sleep() at all EVER in a Windows game if you want to reliably hit your frame rate.

And yes yes this is with timeBeginPeriod(1) and this is even with setting my priority to THREAD_PRIORITY_TIME_CRITICAL. And no nothing at all is running on my machine (I'm sure the average user machine that has fucking Windows Crippleware grinding away in the background all the time is far worse).

So I have no solution and I am unhappy.

There definitely seems to be some weird stuff going on in the scheduler triggering this. Like if I just sit in my main thread and do render-flip , with no thread switches and no file IO, then my sleep times seem to be pretty reliable. If I kick some streaming work that activates the other threads and does file IO, then my sleep times suddenly go nuts - and they stay nuts for about a second after all the streaming work is done. It's like the scheduler increases my time slice quantum and then lets it come back down later.

My links on this topic :

Using Waitable Timers with an Asynchronous Procedure Call (Windows)
Timing in Win32 - Geiss
Priority Boosts (Windows)
Molly Rocket View topic - Why Win32 is a bad platform for games - part 1
IDirectDrawGetScanLine
How to fight tearing - virtualdub.org
Examining and Adjusting Thread Priority
Detecting Vertical Retrace with crazy vxd
CodeProject Tearing-free drawing with mmtimer
CodeGuru Creating a High-Precision, High-Resolution, and Highly Reliable Timer, Utilising Minimal CPU Resources

I am the linkenator. I am the linkosaurus rex. Linkotron 5000. I have strong Link-Fu. I see your Links are as big as mine. My name is Inlinko Montoya, you linked my father, prepare to be linked!


ADDENDUM : The situation seems to be rather better in Dx9+. You can use VSync in Windowed Mode, and the VSync does in fact seem to sleep your app. (older versions of Dx could only Vsync in full screen, I didn't expect this to actually work, but it does).

However, I am still seeing some really weird stuff with it. For one thing, it seems to frequently miss VSync for no good reason. If you take one of the trivial Dx9 sample apps from the SDK, in their framework you will find that they don't display the frame rate when Vsync is On. Sneaky little bastards, it hides their flaw.

In any sample app using the DXUT framework you can find a line like :


txtHelper.DrawTextLine( DXUTGetFrameStats( DXUTIsVsyncEnabled() ) );

This line shows the frame rate only if Vsync if Off. Go ahead and try turning off VSync - you should see a framerate like 400 - 800 fps. Now change it to :

txtHelper.DrawTextLine( DXUTGetFrameStats( true ) );

so that you show framerate even if Vsync is on. You will see a framerate of 40 !!!

40 fps is what you get when you alternate between a 30 fps and a 60 fps frame. ((1/30) + (1/60))/2 = 1/40 . That means they are missing vsync on alternating frames.

If you go and stick a timer around just the Present() call to measure the duration spent in Present, you should see something like :


no vsync :
    duration : lo : 1.20 millis , hi : 1.40 millis

vsync :

    duration : lo : 15.3 millis , hi : 31.03 millis

Yuck. Fortunately, the samples are really easy to fix. Just put this line anywhere in the sample app :

timeBeginPeriod(1);

to change the scheduler granularity. This says to me that Present() is actually just calling the OS Sleep() , and is not using a nice interrupt or anything like that, so if your scheduler granularity is too high you will sleep too long and miss vsync a lot. In fact I would not be surprised if Present just had a loop like :
    for(;;)
    {
        if ( IsRasterReady() ) break;

        Sleep(1);
    }
which sucks donkey balls for various reasons. (* addendum : yes they are doing this, see below). If in fact they are doing what I think they are doing, then you will hit vsync more reliably if you pump up your thread priority before calling Sleep :

    int oldp = GetThreadPriority( GetCurrentThread() );
    SetThreadPriority( GetCurrentThread() , THREAD_PRIORITY_SOMETHING_REALLY_FUCKING_HIGH );

    Present( ... );

    SetThreadPriority( GetCurrentThread() , oldp );
    
this tries to ensure that the scheduler will actually switch back to you for your vsync the way you want. (this has no effect in the samples cuz they have no other threads, but may be necessary if you're actually stressing the system).

Note the samples also freak out if they don't have focus. That's because they actually call Sleep(50) when they don't have focus. That's a cool thing and not a bug ;)

To wrap up, here's the approach that I believe works :


Use dx9+

Use Vsync and 1 back buffer

Bump up thread priority & scheduler interval to make sure Present() doesn't miss

Make main thread higher priority than worker threads so that workers run when main is sleeping
(must also check that workers are not starving)

On machines with speed-step , *and* if the app has focus :
Run a very low priority thread that just spins to keep the cpu clocked up

and cross your fingers and pray.

BTW whenever you show something like framerate you should show four numbers :


Instant current value this frame.
Average over last N frames.
Low over last N frames.
High over last N frames.

with N like 10 or something.
obviously you can do this trivially with a little circular buffer.

You can see a lot of things that way. For example when you're doing something like missing vsync on alternating frames, you will see the average is perfectly stable, a rock solid 40.0 , but the instantaneous value is jumping up and down like mad.

BTW cb::circular_array does this pretty well.

ADDENDUM : I found a note in the Dx9 docs that is quite illuminating. They mention that if you use PRESENT_INTERVAL_DEFAULT, the action is to sync to vblank. If you use PRESENT_INTERVAL_ONE, the action is the same - but there is a difference, it automatically does a timeBeginPeriod(1) for you to help you hit vsync more precisely. This tells me that they are in fact doing the Sleep()/GetRasterStatus loop that I suspected.

3/01/2009

03-01-09 - Clean Up

It does actually help to clean up your machine from time to time. One issue is that NTFS dir listing on large dirs is really slow, so huge dirs full of temp files makes dir listing chug. Another issue is that I believe Firefox has an N^2 kind of bug with temporary files where it eventually becomes outrageously slow.

Anyhoo, cleaning up is also a very good idea before you do a defrag. It's silly to defrag with a huge mess of temporary files when it would be better just to kill them and make them all from scratch afterward.

This is the kind of thing that you probably only need to do like once a year. BTW not a bad idea to do before backing up your machine since it's kind of silly to back up a bunch of temp files. (this stuff was literally 20% of the bytes on my disk - 10 GB out of 50). It's a little bit of a catch-22 because of course you'd like to back up *before* you run around deleting lots of junk. More on backups later..

Here's my cleanup routine :

On every drive :


call dele *.chk
call dele ffast*
call dele -y recycled

And try to get the temp dirs :


call zdel -y -r  %TEMP%\*
call zdel -y -r  %TMP%\*
call zdel -y -r "%USERPROFILE%\local settings\temp\*"
call zdel -y -r "%USERPROFILE%\local settings\temporary internet files\*"

Then go to your source code and wipe out the crud there :


zdel -r "*.(ncb|sbr|pch|bak|pdb|ilk|idb|opt|plg|bsc|obj|suo|log_prev)"

My god the ncb's and the suo's and all that stuff are a huge waste of space. Note I have "obj" and "pdb" in there - you might not want to delete those. I do. If I can't rebuild the obj I figure I don't care about that code any more.

Oh but there's more. Lots of programs keep their own huge temp dirs. (there was one really bad culprit of this and now I forget what it was; damn). One is "Visual Assist".


zdel -r "C:\Program Files\Visual Assist.NET\vc7\cache\*"
zdel -r "C:\Program Files\Visual Assist.NET\vc7\history\*"

I assume you run a P4 server or something at home, so go ahead and checkpoint your journal too :


p4d -jc -z

And I don't have a good non-GUI way to do this, but we've missed some shite from Firefox, so use the GUI to delete all history and temporary files.

And finally we've still missed a ton of garbage in "Documents and Settings" but it's too much of a pain and too much of a risk to run around deleting more, so that's it. NOW we can defrag.


After all that I backed up my lappy. It took 6 bloody hours. I have a 60 GB disk in lappy and it can do nearly 60 MB/sec. That means it should take 1000 seconds. 17 minutes. To just copy the whole disk. Sigh.

I've tried a few backup programs in the day and they all have this severe order-of-magnitude slowness problem because they use the damn file system which is butt slow rather than just mirroring the whole disk. Yes I know there are programs to just do a full disk mirror but from what I can tell, none of them let you then treat the results as just a bunch of files you can browse into.

I never want a backup system that creates some custom file format on the backup server - I just want files. I never want to have to do a "recovery" - I just want to go browse to the backups and find the old files that I backed up and copy them out manually.

What a backup program should do : 1. Read the whole disk as a raw image so it's super fast. 2. Compress the disk with a fast LZP or something in memory as it goes. (compression here is not really to save space - it should actually make things faster because compressing is faster than disk write speed). 3. Write the compressed image to the server. 4. Provide an explorer plugin so that you can browse into the stored image as it was just a dir and copy files out.

The whole thing should take maybe 30 minutes max.

Oh, while I'm at it - HYACHACHCA - "robocopy" and "xcopy" both have huge terrible bugs. There are certain file names that they can enumerate in their FindFirst/FindNext which they then fail to find when they actually try to copy the files. I've written before about the disaster of file system names on Windows but this is ridiculous. Also robocopy has this retry-wait thing, which is nice if it's because of a file being in use or something, but when it's a hard error like "file not found" it's dumb to spin in retry-wait for 30 minutes. And there's no key you can press to just make it give up and skip that file, you have to ctrl-C and start over at which time you just give up and use your own copy prompt because nobody else seems capable of writing the simplest piece of software.

2/27/2009

02-27-09 - Why is Intel selling me Software -

Intel used to do a pretty good thing where they would write nice like optimized widgets for their chips and give them out as source code to developers for free. Obviously that makes a lot of sense in their business, it sells more chips.

Now they want to charge me for the optimized math routines !? WTF Intel !? And even if I buy it, I don't get source code, I just get DLLs. Not just DLLs, a HUGE MESS of DLLs - 118 of them taking 283 Mega bytes !

Granted, the "IPP" does a whole lot of stuff. It's got jpeg, H264, MDCT, MP3, hell it's got CELP, BWT, it's got like every damn thing in the universe, but it's all useless because it's not cross-platform, it's not source code, and it's a huge complex system.

Give me little isolated code snippets that just do the one thing I want to steal. Don't force giant systems on me.

I want a good little nugget of *functionality* , not a framework. I have my own framework.


I've been thinking about this a lot recently because to some extent the way I'm writing Oodle right now it's becoming a framework. I think it's becoming pretty good, it's getting to be pretty easy to make really cool stuff happen, but nobody wants to buy a framework. Nobody wants to have to buy into a "system" that will make their life easier - they want to have their own framework and just call you for the little bit of hard stuff they don't want to bother with. I need to make sure Oodle does that.

The danger for me is that I really like writing framework code. The way I like to code, I build up a huge base of really strong fundamentals into a whole system that does everything I want, and then I can just easily tack bits together to make application functionality. So maybe 90% of the code is "library" or "framework" and the glue and application is minimized.

Like to do my new vert snapper for galaxy it was like - I already have my hash map, I already have my int geometry quantizer, boom tack it together, it's like 20 lines of code. That's awesome, but to share that code you have to buy into my framework - if you try to pull just the vert snapper, you find that it's got roots that go deep into the earth of the framework and cling onto the vector, the int vector, the int-float math, the hash map, the std::vector, my template util set, etc etc you probably wind up with 20 files.

I wish you could sell mesh algorithms. Just looking back at Galaxy to do the BMP triangle removal makes me miss it; the algorithms are really complex and interesting, the results are concrete and compelling, and there's so much more to do. I just don't see how to make it a product. Like for example a really good UV mapper I think would be a great product, it's hard to do, it doesn't really exist, and everyone needs it. But they don't know they need it.

2/26/2009

02-26-09 - Low Level Threading - Table of Contents

Table of Contents since searching Blogger is so awkward :

(the older articles here were written before I totally understood this stuff but I believe they are still correct, if a bit more confused)

Part 1 : Introduction to the issues, volatile, memory barriers, and interlocked ops.

Part 2 : Thread Safe singleton and info about CRITICAL_SECTION.

Part 2.01 : A bunch of links.

Part 2.02 : Clarification of x86 memory model and dependent read race issues in earlier posts.

Part 2.5 : TLS

Annotated Links

Part 3.0 : Intro to lock free programming and memory models.

Part 3.1 : Introducing CAS and cache line issues.

Part 4.1 : LIFO MPMC stack

Part 4.2 : LIFO MPMC stack : ABA

Part 4.3 : LIFO MPMC stack : issues

Part 5.1 : FIFO SPSC queue : simple and fast

Part 6 : Wrap up , Relacy, and the code.

Added 06-12-2012 : See also the newer threading post index .

02-26-09 - Low Level Threading - Part 5.1

SPSC FIFO : the simple message queue

Phew, I've lost the spark that was fueling me to write these, but I want to get one more out before I quit : the SPSC FIFO. This is a simple and useful structure, and I think it's a nice example because it illustrates many different properties from the MPMC stack.

SPSC means single producer single consumer ; one thread makes nodes, another uses them. Immediately we don't need to worry about all the GC issues and node lifetime because we have just a single consumer. We can easily make it two ways just by having an SPSC going in each direction. We can easily make it MPMC by using many of them, eg. for 3 threads to all talk both ways to each other uses 6 SPSC FIFOs. In many cases this is actually much more efficient that using an MPMC structure.

It's a FIFO so there are two ends, and we know only the producer thread is touching the input end, and only the consumer thread is touching the output end. Thus, even though it is a multi-threaded structure, the calls to "enqueue" and "dequeue" are always single-threaded against themselves. This is crucial.

You can also make either end of the SPSC into multi by putting a critsec on that end and changing the convention for who the "owner" is : rather than the "producer" being a certain specific thread, you define it to be the locker of the producer critsec. Now you have an MPSC queue which requires a lock to produce but none to consume. Or you could do SPMC or MPMC this way. This is okay in some cases, not in others. Herb Sutter designed a queue like this in Dr. Dobbs recently (see link in annotated links area). It's certainly better than having a single lock for the whole queue, because if only one thread is producing or consuming they will get the lock with no contention, but obviously it's not lock-free.

You can do a FIFO in an array, but it's actually much harder that way. If you had an infinite amount of memory for your FIFO it would be very simple indeed, but you don't, so you either have to make it circular or dynamically resize the array. Both are possible lock-free (or near-lock-free anyway), but they both break the magic property that the consumer and producer don't fight over the structure. If it's circular you have to worry about the head running into the tail and suddenly you have the producer and consumer contending. So we'll do with nodes and a linked list.

The FIFO is a linked list from the head to the tail :


m_head -> node -> node -> node -> m_tail

We push new nodes on the tail and pop them from the head. We want to do both of these as simply and unconditionally as possible, so we don't allow a totally empty queue. We always have a dummy node in the Queue so that when the queue is empty, head == tail.

In particular it should be clear that the Push operation is very simple and unconditional :


newNode->next = NULL;
m_tail->next = newNode;
m_tail = newNode;

When the Queue starts, m_tail points at the dummy, we just tack something in his next. Then at all times we know we can just jam into the next pointer at the end to add something at the list. The key is partial ownership of the structure by the different threads.

The consumer owns the "m_head" pointer - but not necessarilly the node that m_head points to (if the queue is not empty then it also owns that node). The producer owns the "m_tail" pointer and also the "->next" pointer on the last node (which is also the first node if the queue is empty). The funniest case occurs when the queue has one element in it - the dummy plus one more - in that case the producer owns the "->next" pointer in the last node, but the consumer owns the data in the last node ! More on this later.

Okay let's do the Lock Free Push in detail :


struct SPSC_FIFO_Node
{
    Atomic< SPSC_FIFO_Node * > m_next;

    void * m_data;

    // optional cache line padding here
};

struct SPSC_FIFO
{
    Atomic< SPSC_FIFO_Node * > m_head;

    // cache line padding here

    Atomic< SPSC_FIFO_Node * > m_tail;

    // optional cache line padding here
};

void Push(SPSC_FIFO * fifo, LF_SPSC_FIFO_Node * node)
{
    // at this point I own "node" completely
    // I wrote node->m_data before coming in here
    StoreRelaxed(&(node->m_next), NULL); // relaxed store because I own it

    // get the tail pointer - again I own it so relaxed load is fine
    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    // as soon as I set back->next here , node is now visible to the consumer
    // I no longer entirely own node (I do still own node->next)
    // Release here to ensure node->next and node->data are written first
    // (*1)
    StoreRelease(&(back->m_next),node);

    // now advance tail -
    // this is just for me, so relaxed is fine
    StoreRelaxed(&(fifo->m_tail),node);
}

No atomic RMW's at all! No consistent checks - just careful memory ordering. These are slightly subtle :

(*1) is the point where the consumer might be able to get at node. Obviously the stuff inside node must be out to him before he pops node. You do not need any kind of store fence here when you use the Release memory constraint. We're writing data out in a stream :


node->data = A;     // 1 - Relaxed
node->next = NULL;  // 2 - Relaxed
back->next = node;  // 3 - Release

Release means this can go out  1,2,3 or 2,1,3 , but never 1,3,2 or something invalid.

Similarly when the consumer Pops next, he will use an Acquire :

popped = head->next; // 1 - Acquire
head = popped->next; // 2 - Relaxed
data = popped->data; // 3 - Relaxed

Again this can go 1,2,3 or 1,3,2 but never a bad order that would see whack data.

BTW on most architectures (*1) doesn't actually even need to be Release because the loads we care about are dependent, and dependent data automatically is acquire/release on every major architecture, but pretend you heard that.

Also note why we used the dummy - the tail can never be popped. If we didn't have that you could have sequences like :

    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    .. now consumer thread runs and pops everything ...
    .. !! back gets freed !!

    StoreRelease(&(back->m_next),node);

However, the dummy is a bit weird in pop. After we push a bunch of nodes, it's sitting there at the head with nothing in it. If we pop a node, we pop the dummy. We can't throw it away, because we need a dummy to always be in the queue so that we never pop the last node. A common pattern used in other cases is to repush the dummy. (this is used in the MPMC FIFO and the MPSC FIFO - if you pop the dummy, you just push it back on the other end). But we can't do that here because we are the consumer, not the producer, and this is SPSC, so we don't have access to push. Thus we must use the method of sliding the dummy down the list.

It looks like this :


SPSC_FIFO_Node * SPSC_FIFO_Pop(SPSC_FIFO * fifo)
{
    // m_head is totally owned by me so I can load relaxed :
    SPSC_FIFO_Node * front = LoadRelaxed(&(fifo->m_head));
    // front is the current dummy, it is never null
    // front->m_data is garbage

    // (*1) 
    // front->next is the sync point with the producer !
    // if front == tail we are sharing this variable
    SPSC_FIFO_Node * next = LoadAcquire(&(front->m_next));
    if( next == NULL )
        return NULL; // empty

    // next exists ; now the node "next" is mostly owned by me
    //  (the producer owns next->m_next)
    // front will be returned, it is wholly owned by me

    // shift data from next to front :

    void * data = LoadRelaxed(&(next->m_data));

    // update m_head - relaxed, we own it
    StoreRelaxed(&(fifo->m_head),next);
    
    // copy data into front
    StoreRelaxed(&(front->m_data),data);

    // next is now the dummy
    //  next->m_data is garbage (it's a dupe of what we return here)

    return front;
}

Again it's pretty simple, in fact it's harder to understand that it works than to just write it and use it ;)

It's interesting to look at a slightly different possible way to write Pop. This alternative way is often suggested around the net, but I believe it's worse. Instead of checking front->next to check for empty , it checks head == tail :


SPSC_FIFO_Node * SPSC_FIFO_Pop_Alt(SPSC_FIFO * fifo)
{
    SPSC_FIFO_Node * front = LoadRelaxed(&(fifo->m_head));

    SPSC_FIFO_Node * tail = LoadAcquire(&(fifo->m_tail));
    if ( front == tail )
        return NULL; // empty

    // this is all the same :
    SPSC_FIFO_Node * next = LoadAcquire(&(front->m_next));

    StoreRelaxed(&(fifo->m_head),next);

    front->m_data = next->m_data;

    return front;
}

If we use this Pop_Alt with our Push there's a race condition. The problem is in these two lines of Push :

    StoreRelease(&(back->m_next),node);

    (*2)

    StoreRelaxed(&(fifo->m_tail),node);

What's the problem?

At (*2) one possibility is that the ->next pointer is set, but m_tail is not updated yet. Thus when Pop_Alt is called, it will see front == m_tail and return empty even though the node is there. That's not the problem. In fact there's nothing wrong with that. We never said that Pop would *immediately* be able to see a new node if the Push wasn't done yet. In fact with Lock Free stuff you can never gaurantee the timing of memory visibility.

The problem at (*2) is that the stores don't have to go in order because we used a relaxed store for m_tail ! That was okay before because Pop never touched tail, but now if they run out of order :


    fifo->m_tail = node

    back->m_next = node

tail gets updated first so Pop_Alt thinks the Queue is not empty. It tries to pop, but ->next is not set yet so it gets NULL ! Remember that "Release" does not prevent stores below it from moving up, only stores above it from moving down.

Obviously the fix is just to ensure that the stores go in the right order :


void Push_Alt(SPSC_FIFO * fifo, LF_SPSC_FIFO_Node * node)
{
    StoreRelaxed(&(node->m_next), NULL);

    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    StoreRelease(&(back->m_next),node);

    // Release so these don't reorder !
    StoreRelease(&(fifo->m_tail),node);
}

and now Push_Alt and Pop_Alt work okay. But I believe they are not as good as our original version because the consumer and producer are now both touching the "m_tail" variable which means they cache line fight. There's no contention in the sense that they never do a CAS on m_tail or have to loop or block on each other, but just by having Pop look at m_tail is it causes the line to have to sync. (note that this kind of sharing is way way way better than a CAS - this is just a normal cache line propagation and it's read-only on the consumer which is way better than if it was being written from both threads).

And now for the payoff.

If we use our macros and such to convert these actions into the right native operations that respect the memory order constraints, "Push" compiles to this on x86 :


mov [eax],0        // eax is node , [eax] is node->next
mov [eax+4],edi    // edi is the data element , [eax+4] is node->data
mov ecx,[esi+40h]  // [esi+40h] is the fifo head
mov [ecx],eax      // [ecx] is head->next
mov [esi+40h],eax  // head = node

Five moves for a lock-free FIFO push ! Note there are absolutely zero "lock" prefix ops or memory barriers, and yet the memory is gauranteed to be passed to the consumer safely. Also note that you really did need to be careful with the Release semantics because if these movs are not done in the correctly constrained order, this is not thread safe !!

This is the real payoff for why we learned about the memory models and using load/store with constraints. Carefully designed algorithms can be extremely efficient. Obviously the fact that this is so efficient on x86 relies on the strong x86 memory model which says all stores are Release, but more generally if we write the code with the absolute minimum of constraints, that lets the system translate it into the most efficient method possible for the given platform.

One annoying thing about this FIFO is it can't be intrusive. Push must allocate a none and Pop must free one. So you need a custom node allocator recycling thing. However there is no ABA issue because there is no CAS so you don't have to worry about SMR or anything.

One issue is that standard thread-caching mallocs are very very bad for FIFO queues. The problem is you are by design always allocating the node on one thread and freeing it on a different thread. Standard thread-caching mallocs usually optimize freeing on the *same* thread that did the allocation, and they make cross-thread frees slow. In fact, one option is to use an SPSC queue to pass the nodes back to the producer without any data just so that he can get the memory back ! Another option is to use one of Dmitriy Vjukov's custom allocators that works well with lockfree fifos.

Because this SPSC FIFO is so fast, it's totally reasonable to make N-thread multiway communication channels by multiplexing SPSC FIFOs. In fact that's what the experts in the field do. There is a problem, in that you need O(N^2) channels, which becomes bad for the large N of the future, but for N <= 8 or so it's a totally reasonable way to go.

The art of good lock free system design is in combining various simple structures to create a more elaborate information flow pattern. And don't forget to make good use of plain old single threaded structures too! For example, if you are often passing many data items at once, don't push them all one by one. Intead, link them together with a ->next pointer to make a plain old single-threaded linked list, and push the whole linked list with just one lock-free op.

Another cute trick I've seen is to transfer SPSC FIFOs using a more expensive structure like our MPMC LIFO :

Say for example you have N producers and M consumers, N and M both large. The producer wants to jam a bunch of nodes to some consumer and have them worked on, but it doesn't care who. Once some consumer is eating his nodes, he may continue to push them intermittently over time. So I can't just push them all in one big packet. What we do is use one MPMC LIFO or FIFO for all the producers and consumers. Our producer pushes a pointer to an SPSC FIFO. The first consumer that has free time pops the big shared MPMC structure and gets the FIFO. Now he owns it and is the single consumer - now the producer and consumer can talk directly to each other over that link efficiently without going through the slow MPMC structure.

I think I will try to write just one more of these with an introduction to Relacy Race Detector, and maybe a brief sketch of the standard thread pool / work stealing thing that people do now, and I will post my code for all this.

2/25/2009

02-25-09 - Low Level Threading - Part 4.3

LIFO MPMC Stack : Performance Issues

Okay, we have our LIFO MPMC stack finally working, let's talk about what it does.

It sucks. If you actually try to use it with a lot of producers and consumers, it will grind to a halt. The problem is very heavy cache line contention.

A stack only has a single entry point - the "top". Anybody who wants to push or pop from it must go through g_stack and they all fight over each other. To do a Push or a Pop each thread has to do a CAS which we've seen before is pretty expensive. But the real problem is the cache lines. Let's look at Push in terms of what it does to cache lines :


LF_Result Push_NoLoop(Node * node)
{
    // request the cache line with g_stack for my processor
    //  if someone else is doing a CAS we stall for them to release exclusive lock on the line !
    //    then we stall to fetch it from them
    Node * localStack = LoadRelaxed(&g_stack);

    node->m_next = localStack;

    // CAS first requests a reservation of the stack line of g_stack and must stall on all other CAS's
    //    that also evicts the cache line of g_stack from all other processors
    return CAS_Release( &g_stack, localStack, node );
    // CAS is done, g_stack cache line gets flushed out
    //    other processors can now fetch it to their cache
}

As I mentioned before this isn't a huge disaster on shared-cache multi-core chips, but it's death on real SMP systems without shared caches.

If you even just have one thread that is pushing lots of items and another thread popping lots of items - they will be constantly trading exclusive access to that cache line, forcing the other one to evict it, back and forth over and over. (some day we will talk about a much better way to do an SPSC FIFO).

Because we know this cache line problem is so bad, we should at least put padding around g_stack. Say we have it at file scope with some other variables :


int g_x;
LFStack g_stack;
int g_y;

We know g_stack is thrashing cache - but now anybody touching g_x and g_y is also feeling the pain and causing us more problems. So we should isolate it :

int g_x;

int _pad1[CACHE_LINE_SIZE];
LFStack g_stack;
int _pad2[CACHE_LINE_SIZE];

int g_y;

This way at least our problems are only our problems. It sucks that we're introducing dependence on a hard-coded cache line size. (on x86 CACHE_LINE_SIZE seems to be 64 now).

Note that the Nodes have a similar though smaller issue. If your Nodes are say 16 bytes or so, and they come from a Node Pool Allocator, that allocator typically will put all the nodes in a row with each other. So now you have two different threads that pop off two nodes from a Stack - those nodes will often be directly adjacent in memory. Boom cache line contention. The big issue is always cache line contention.

One solution is to pad up your nodes to CACHE_LINE_SIZE. If you don't want to suffer that penalty of wasted memory use from padding your nodes, you could use a strided allocator, rather than return node[0], node[1], node[2] , instead you return node[0] , node[4] , node[8] , then node[1], node[5], node[9], etc...

Also note that all this CACHE_LINE_SIZE aligning of things is actually really bad for cache use because of the way caches are keyed. It's a well known cache use improvement optimization to make sure you *don't* make objects exactly CACHE_LINE_SIZE offset from each other. So you might want to align your nodes to (CACHE_LINE_SIZE+8). Sigh.

One obvious way to reduce contention on something like this MPMC stack is striping or binning. Rather than just have one stack you have 4. To push a node you use round robin or rand and push on one of them. To Pop you try them all. This helps in the case that your stacks are usually not empty so Pop usually succeeds on it's first try. There are better solutions to this problem though.

One big optimization for the MPMC stack that you can use in some cases is to Pop all. Rather than popping the stack, the consumer just grabs the whole thing using atomic exchange :


Node * Pop_All()
{
    Node_And_Seq nullStack = { 0, 0 };
    Node_And_Seq localStack = Exchange(&g_stack, nullStack);

    // localStack.p is now exclusively owned by me

    return localStack.p;
}

This is an unconditional exchange, not a CAS, we don't need to loop or anything. We make the stack empty and take ownership of the whole contents. We can now work on a bunch of nodes in a batch which gets rid of the pain of the cache contention.

ADDENDUM : I should say that there are use cases the MPMC stack are really good for. One is the case that you are mostly just using it from one thread, that thread is doing 90% of the pushing and popping to the stack, so it gets to own the cache line, but once in a while other people need to push or pop things to the stack. It's fine for that, especially if the "once in a while" tends to happen when I'm not heavily hitting it.

Okay, I think we have now talked about the majority of the issues of lock free coding. In part 5 we will talk about the SPSC FIFO which is a very elegant illustration of the advantages of memory models and partial exclusive ownership.

02-25-09 - Low Level Threading - Part 4.2

LIFO MPMC Stack : ABA

I'm sure many of you know the remaining problem with the Stack in the last post is ABA. Let me briefly sketch how it bites you :


stack starts empty

Thread1 pushes A
    stack = A ; A->next = NULL

Thread2 starts to pop and sees that head is A ,
    it sees A->next is NULL

Thread1 pops A
    stack = NULL;
Thread1 pushes B
    stack = B; B->next = NULL;
Thread1 pushes A
    stack = A -> B -> NULL;

Thread2 finishes popping
    head is still A , so CAS succeeds
    stack = NULL;
    return A;

!!! What happened to B !!!

Okay, this has been discussed a lot around the net. In the case of these kind of Node pointer based data structures, you could say that the problem is because we are recycling nodes. That is, Thread1 popped A and then recycled that memory and pushed A again. If we never recycled nodes we would not have ABA problems here.

In fact, one solution that some of the Lock Free heavyweights use to this problem is to use special variants of GC that prevent ABA. If you are using a GC like SMR (Safe Memory Reclamation), then the GC knows that Thread2 is holding a pointer to A, so when Thread1 tries to push a new node, it won't be given A (even though A has been freed).

Using a GC that prevents ABA is nice because it makes your LF code simpler and removes ABA all over. I'm not going to go into the details of how to implement SMR but there are some links on it in the big Annotated Links post that I made.

However, I want to stress that ABA is not just a problem with memory recycling. In fact, the ABA problem is because CAS doesn't really do what we asked it to do !!

If you recall back when I introduced CAS I said we were using it to prevent us from stomping changes made by other threads while we were preparing our change :


do
{
Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...

// only apply newData if sharedData hasn't changed since we started : 
} while ( ! CAS_Release( &g_sharedData, localData, newData ) )

But that's not what CAS actually does !! CAS checks that the *value* is the same as when we started, not that it hasn't been touched !

What we really want to be checking is a time stamp or a revision count on the memory, or some kind of GUID for the object. We want to know if the object is the same object that we loaded in the beginning of our loop. Obviously if the g_shared thing encapsulates the entire state of the shared object by value, then it is fine to check it with CAS.

The problem with the LIFO stack is that we are using CAS just on the head pointer - which is NOT a valid unique ID for the whole stack. Obviously {A->NULL} and {A->B->NULL} have the same head pointer but are very different stacks. Using CAS just on the head is a bug.

The standard solution to this problem is to use a "uniqueifier" , a sequence id or revision count. This is very common practice any time you're recycling things that you need to be considered different even though their value is the same. We keep our Node the same, but the root of the stack changes :


struct Node_And_Seq
{
    Node *    p;
    int        s;
};

Atomic< Node_And_Seq > g_stack = Node_And_Seq(0,0);

Note that we are now making "Node_And_Seq" together the object that must be Atomic. If the pointer is 32 bits, this object is 64 bits and we now rely on having an atomic 64 bit CAS. More on this later.

Now we just change our Pop to also increment sequence and work atomically on Node_And_Seq :


LF_Result Pop_NoLoop(Node ** pNode)
{
    Node_And_Seq localStack = LoadAcquire(&g_stack);

    *pNode = localStack.p;

    if ( localStack.p == NULL ) return LF_Success; // empty stack , *pNode is NULL
    
    // get next pointer and bump sequence count :
    Node_And_Seq nextStack = { localStack.p->m_next, localStack.s + 1 };

    // try to step stack to g_stack->next if noone else has changed it
    return CAS_Release(&g_stack,localStack,nextStack);
}

now the CAS at the end is checking that localStack has the same pointer *and* the same sequence count because it is atomically comparing 64 bits. Also note that the Load at the beginning atomically loads 64 bits so that it always loads the stack head pointer and sequence count together !

If you had written your Pop like this :


LF_Result Pop_NoLoop(Node ** pNode)
{
    Node * localStack = LoadAcquire(&g_stack.p);
    int localSeq = LoadAcquire(&g_stack.s);

    ....

}

you would have written a race condition !! But of course you wouldn't do that, right? You know that Node_And_Seq must be treated as a single combined object to make it a valid GUID for the whole stack. Well, the classic Fober-Orlarey-Letz paper on LIFO has this very race in it.

(Aside : Lock Free programming is *hard* ; even many of the academic papers detailing lock free algorithms have races!)

Note that I've only talked about changing Pop. We don't need to rev the sequence number in Push, because Push can never cause ABA on its own - you're never pushing a node that's already in the stack unless you have a bug. To cause ABA you must have at least one Pop, and there we rev the sequence id.

Technically under C++0x Push should be changed to also work on Node_And_Seq so that both Push and Pop are doing atomic 64-bit ops on the g_stack shared data variable. On some processors if you address the same memory with different size loads & stores, it is not gauranteed to be atomic. On x86 it's a very small perf win to keep using just the 32-bit head pointer in Push, and it's safe, but I personally have just switched to doing 64-bit all the time because it's better to avoid problems (and it works in Relacy).

Note that if we're doing Node_And_Seq we could also easily make the sequence count 16 bits and adding a 16 bit stack depth counter. That way we could quickly support depth queries. (Windows SList does something like this). Note that those 16 bit accesses may not be atomic, but you should never try to do that anyway - you always do a full 64-bit atomic copy from g_stack to a local variable, then you poke around inside it.

Okay, now this is all good but what happens when we go to 64-bit pointers? We could make Node_And_Seq be 128 bit. New x64 chips have 128-bit CAS so we could use that. The problem is that the first generation of x64 did not have CAS-128 (cmpxchg16b). Fortunately, we don't actually need a 64-bit pointer. There are a few different solutions :

1. We know our memory is always in our application's virtual address range. Our app is never going to have more than maybe 50 bits of virtual address range. So rather than store a pointer in our Stack, we can store an offset from the base of our virtual address range. You could use 48 bits for offset and 16 bits for sequence Id. Note that the bottom 3 bits of a node address must always be zero because nodes must be 64-bit aligned (8 byte), so you actually get 51 bits of offset this way.

2. Our Nodes for Stack are coming from a Node Pool Allocator (GC) of some kind anyway. So again we can just have our Node Pool allocator reserve a virtual memory address range and then we know that our Nodes are always in that range. In fact, it could reserve two gigs of virtual address range for Nodes, and then we can just use 32 bit pointers and not worry about any of this !!

I posted a link earlier about how Windows SList deals with this problem, and it's rather intersting. But it's important to note that your implementation does not have to deal with the same issues as them - you control where your memory is coming from (it's never in the kernel address range for example), they have to be able to link any memory in the system.

old rants