2/18/2009

02-18-09 - Perfectionists are Good Bosses

I pulled Jeff in to show off my awesome thread work dispatcher that I was all proud of. I pull up my threadprofile viewer that shows the concurrency, and he immediately goes "what is this time gap at the end where you don't show anything happening?". I was like "hmm god dammit I'm trying to show off my cool shit here don't point out the flaws". But today I realized that on the main thread I was doing a sleep-loop to wait for the work to all get done, when really I should be using an event to wake the main thread when workers run out of work. With a sleep loop you can waste time up to the length of your sleep interval. With a wait event you wake instantly when its time to wake.

I already knew about the evils of sleep-loops and the goodness of events for doing thread waiting, I just got lazy in this particular loop and did it the sleep way cuz its easier. Perfectionist bosses can be annoying when you feel like you did something well enough, but they're also damn helpful, and they push you to be honest about the quality of your work. It's the role I used to play at OW, and I missed not having it when I was working on my own. Anyway I fixed it and the missing time gap disappeared and now I am happy. (this could also be a blog about the awesomeness of good tools - I never would have known about the time gap without the threadprofiler and viewer).

BTW about windows events : something that has been hammered home for me recently is that Windows manual reset events are just a race condition waiting to happen. !! Always use auto-reset events !!

For example, if the worker thread is sending you signals once in a while to let you know certain things have happened, if you have code like this in the main thread with a manual reset event :


WaitForSingleObject ( hEvent );
  (*) race here
ClearEvent( hEvent );

That's a race. If your thread is swapped out after the Wait but before the Clear, other threads can run along and signal the Event. Then it switches back to you and you clear the event and lose any signals that were set.

I have often used manual reset events in the past successfuly, because I use them to signal a shared data item has changed which is then protected by a critical section. As long as you take the critical section and use the shared data to reevaluate whether to wait, that pattern is safe from races even with a manual reset event (it doesn't matter than you may miss some event signals because you will see it when you check the data). But now that I am doing more lock-free stuff, it's more important to get the event right, and hell there's no reason to not just use auto-reset always, so do!! (when in doubt, be safer).

And now a teaser/brag. This is the performance of the work dispatcher :


linear        : 0.770 seconds, 2312158485 clocks   100.000% of linear time
1 threads     : 0.770 seconds, 2313504188 clocks   100.058%
2 threads     : 0.386 seconds, 1160138775 clocks    50.175% , speedup = 1.993 = 0.99650 per core
3 threads     : 0.258 seconds, 774830107 clocks     33.511% , speedup = 2.984 = 0.99467
4 threads     : 0.194 seconds, 582231143 clocks     25.181% , speedup = 3.971 = 0.99275
5 threads     : 0.194 seconds, 582824062 clocks     25.207%
6 threads     : 0.194 seconds, 583144837 clocks     25.221%
7 threads     : 0.194 seconds, 583397962 clocks     25.231%
8 threads     : 0.194 seconds, 582415282 clocks     25.189%

Making a work dispatcher with almost perfect scaling is not really that impressive (there are plenty existing around the net - and actually if you are Windows-only there are some good ones built directly into Windows), but there are some things that I am happy about :

The loss from going from single-thread linear to dispatched in packets is very close to zero.

The loss from running lots of small packets as opposed to a few big ones is very close to zero. (I tried packets of between 4 and 64 items and it made almost no difference).

There's no slowdown when you have too many threads (obviously this is a 4 core machine), which is a common flaw in broken work dispatchers (they start getting slower and slower with too many threads because they have contention issues and they fight each other).

(I should say this testing is on a quad-core chip, which doesn't show scalability problems nearly as much as a true 4-proc machine with separate caches, so there may still be scalability flaws I don't know about)

2/16/2009

02-16-09 - Low Level Threading Junk Part 2.5

Back to threading. See the old posts to catch up . We talked a bit about memory models and singletons and such in the last section. One thing we haven't really talked about is the value of TLS (Thread Local Storage). In our singleton discussion we mentioned that the C++ "static" is a huge dangerous trap that should be avoided. In fact, all statics and globals are dangerous and must either be eliminated or thread protected.

One way to fix statics/globals nicely is to make them thread-local. In fact, I would argue that statics/globals should be thread local by default, and only with great care should you make them not thread local. Basically access to TLS variables is totally fast and safe and you don't need to worry about threading issues as long as you only work on TLS values. It's a good pattern to have minimal thread communication areas, and then copy what you need into your TLS. (of course there are other ways to do "thread local" data other than TLS, such as just by convention - like you know that a certain global is only touched by one thread, but that requires care and commenting).

The TLS is a block of memory associated with each thread that you can index into to get at variables; it's analogous to the statics section of a normal exe. Now there are functions to manually get access to the TLS but there's basically no reason to use those, and I don't recommend it. Every compiler on every platform now provides something like "declspec(thread)" which declares a variable to go in TLS. You just define it like :


__thread int thisVarIsInTLS;

If you want to know a lot about the details of how TLS works, I found a great set of articles by Nynaeve all about how TLS is implemented in Windows :

Nynaeve � Blog Archive � Thread Local Storage, part 1 Overview
Nynaeve � Blog Archive � Thread Local Storage, part 2 Explicit TLS
Nynaeve � Blog Archive � Thread Local Storage, part 3 Compiler and linker support for implicit TLS
Nynaeve � Blog Archive � Thread Local Storage, part 4 Accessing __declspec(thread) data
Nynaeve � Blog Archive � Thread Local Storage, part 5 Loader support for __declspec(thread) variables (process initializatio
Nynaeve � Blog Archive � Thread Local Storage, part 6 Design problems with the Windows Server 2003 (and earlier) approach to
Nynaeve � Blog Archive � Thread Local Storage, part 7 Windows Vista support for __declspec(thread) in demand loaded DLLs
Nynaeve � Blog Archive � Thread Local Storage, part 8 Wrap-up

One important thing to note : MSVC supports CINIT for TLS. That means you can use real C++ classes in TLS, and what happens is each time a thread is created, it runs the "cinit" list for those objects. This is *not* widely supported by other compilers, so if you want to write code that is widely portable you should use only basic C types in TLS. It's a bit of a shame because having the cinit and shutdown for TLS would be handy. Instead you have to make your own "shim" or "thunk" routine around your thread to init and shutdown objects.

So far as I can tell, every major compiler on every major platform supports something like "declspec thread" or "__thread". There are some nice advantages to doing this over using manual TLS allocation and indexing. For one thing the compiler can write much better code to access it since the offsets are constants not variables, and it can make use of the direct pointer to the TLS struct. More importantly, you don't need to worry about the initial set up of the TLS indeces in a thread-safe single-instantiation way (which we saw in earlier articles is a pain).

Next post will be on some details of lock-free algorithms.

BTW while I'm posting big blog series, I gathered Larry Osterman's series on Concurrency. This series is not great, there are some errors, but it is pretty comprehensive, so :

Concurrency, Part 1
Concurrency, Part 2 - Avoiding the problem
Concurrency, Part 3 - What if you can't avoid the issue
Concurrency, Part 4 - Deadlocks
Concurrency, part 5 - What happens when you can't avoid lock misordering
Concurrency, part 6 - Reference counting is hard )
Concurrency, part 7 - Why would you ever want to use concurrency in your application
Concurrency, Part 8 - Concurrency for scalability
Concurrency, Part 9 - APIs that enable scalable programming
Concurrency, Part 10 - How do you know if you've got a scalability issue
Concurrency, Part 11 - Hidden scalability issues
Concurrency, part 12 - Hidden scalability issues, part 2
Concurrency, Part 13 - Concurrency and the CLR
Concurrency, Part 14 - Odds and Ends
Concurrency, Part 15 - Wrapping it all up.
Concurrency, part way too many1 - concurrency and the C runtime library.
Concurrency, part way too many2 - Concurrency and the Win32 API set
Concurrency, part way too many3 - Concurrency and Windows

02-16-09 - Amusing Word Problem

Found this and it confused me for a second :

in a country in which people only want boys, every family continues to have children until they have a boy. if they have a girl, they have another child. if they have a boy, they stop. what is the proportion of boys to girls in the country?

Please settle this argument for me. Will it be 50/50?

(assume the a priori chance of boy or girl is 50/50 , and each family stops at its first boy).

2/10/2009

02-10-09 - Image Compression Blues

Good lord the world of image compression is so screwed up. There's no standard image test set (the old Kodak images are archaic, and even then people use different variants of Lena, this new test images set is pretty good but it's not standard so fuck), there's no standard error measure (people even compute RMSE and PSNR differently), and really we should be using a perceptual measure, but again there's no good standard perceptual measure (it is nice to see that things like SSIM are catching on - however SSIM is a bit vague in its specification of blocking, and there are various implementations that do it differently, so we're back in the fucked up comparing apples-to-oranges methodology).

Making it all worse is that people keep making new shitty standards and claiming they "look good to their eyes". I mentioned before that the HD Photo guys were saying some sort of silly things about error. Well guess what, it sucks. I just found this nice benchmark with graphs that shows HD Photo doing much worse than even old baseline JPEG (!!) under the SSIM metric. WTF, how can you do worse than JPEG !? Bush league IMO.

I also wasted some time this morning looking at PGF (libPGF) . PGF is a semi-open wavelet library. It does have some good properties. The code is actually very simple and semi-readable. Compression performance is a bit better than baseline JPEG. On the minus side, compression performance is not anywhere close to state of the art. Even my very simple/fast "cbwave" beats it handily.

BTW looking at the PGF source code this is what it seems to do :

It uses a very simple small integer lifting wavelet transform. It's 5/3 tap transform, perhaps it's the Le Gall transform which is also used in JPEG2000 ? It does not do anything smart about memory flow for the subbands, it's basically a bad brute-force implementation, which means "cbwave" can beat it easily for speed.

The coder is kind of interesting. It's a bitplane based coder with no entropy coder. It just uses a certain kind of bit-packing that makes small streams when there are lots of zeros, it's similar to the old EZW or SPIHT type of zero-tree coding just with bit sequences. Obviously these codecs have implicit modeling and "entropy coding" built in to the bit sequence spec, they just avoid the arithmetic coder. The method in PGF works by breaking the subbands into blocks, choosing a linear walk order on the block, and then doing linear RLE on the bitplanes. The significance bitstreams are basically a few scattered ones with big blocks of zeros, and the RLE is just coding that out.

Another thing I found that I wasn't aware of is the new T.851 variant of JPEG1 ; basically it's a new arithmetic coder called Q15 stuck on the back end of JPEG instead of Huffman or the old QM coder. The IJG is pushing for this, I don't really know what the status is. The performance should be fine. At low bit rates a deblocking filter helps a lot and could make this a decent choice.

02-10-09 - Fixed Block Size Embedded DCT Coder

A while ago when I wrote about DXTC I also included numbers for something new that I called "CodeTree" or "CodeLinear". I figured I should take a second to write down what they do before I forget.

CodeTree and CodeLinear are two variants of a Fixed Block Size Embedded DCT Coder. In the post above I wrote :

Fixed bitrate blocks inherently gives up even more. It kills your ability to do any rate-distortion type of optimization. You can't allocate bits where they're needed. You might have images with big flat sections where you are actually wasting bits (you don't need all 64 bits for a 4x4 block), and then you have other areas that desperately need a few more bits, but you can't gived them to them.

So, what if we keep ourselves constrained to the idea of a fixed size block and try to use a better coder? What is the limit on how well you can do with those constraints? I thought I'd see if I could answer that reasonably quickly.

What I made is an 8x8 pixel fixed rate coder. It has zero side information, eg. no per-image tables. (it does have about 16 constants that are used for all images). Each block is coded to a fixed bit rate. Here I'm coding to 4 bits per pixel (the same as DXT1) so that I can compare RMSE directly, which is a 32 byte block for 8x8 pixels. It also works pretty well at 24 byte blocks (which is 1 bit per byte), or 64 for high quality, etc.

This 8x8 coder does a lossless YCoCg transform and a lossy DCT. Unlike JPEG, there is no quantization, no subsampling of chroma, no huffman table, etc. Coding is via an embedded bitplane coder with zerotree-style context prediction. I haven't spent much time on this, so the coding schemes are very rough. CodeTree and CodeLinear are two different coding techniques, and neither one is ideal.

Now lets go into more details about how this is done. All the basics are quite simple, but we have to be careful about the details. To achieve the fixed block size, I write an "embedded" style stream, and then just truncate it down to the fixed rate for each block. In order for this to be okay, we have to be very careful that we are writing the most important bits first. For very simple blocks (such as single flat colors) we will be able to write the whole block exactly within the rate limit, and no truncation will occur. More typically the lossless size of the block will be very large (maybe 100 bytes or so) and we're chopping it down to 24, 32, or 64 bytes. (obviously any fixed rate size is possible, but those are the main ones that we consider).

We're going to be sending bit planes. We need to order the bit planes from most important to least important, and also order the values within each plane. We have 8x8 pixels, 3 colors for each pixel, and 12 bits for sample. That means we are transmitting 8*8*3*12 = 2304 bits in each block, and we want each bit to be in order of importance. It's crucial to be precise about what we mean by "importance". What we mean is that the signal reconstruction from each bit in order has an expected decreasing contribution to the L2 norm. (L2 norm because we are measuring quality with RMSE - if you have a different quality metric you should order the bits for that metric). (Note that higher bits are always more important than lower bits, so we really only need to worry about ordering the 8*8*3 = 192 samples and we know to descend the bitplanes in order - though there is the possibility of sending the top bitplanes of some coefficients before the top bitplanes of other coefficients).

We're applying two transforms - the DCT and the color conversion. We need both of them to be "Unitary" transforms so that we know how our transformed coefficients relate to RGB pixel values. That is, every value in post-transform space should affect the L2 norm in the same way. Many many people mess this up.

First we have to be careful with the color conversion. I chose YCoCg because it's easy and exact and it works fine. But the standard YCoCg lossless code that people use scaled up the CoCg by *2 compared to Y. That shifts the way the bit planes relate to RGB errors. We could remember this and step through the planes differently, but it's easier just to shift Y left by one bit. That means the bottom bit plane of Y is always zero. In our coder we could just not send these bits when we get down to that bit plane, but I don't bother, because if we get down to the very bottom bit plane it means we're sending the data lossless. Specifically I do this :


void RGB_to_YCoCg(int r,int g,int b, int & y,int & Co,int & Cg)
{
   Co = r - b;
   int t = b + (Co/2);
   Cg = g - t;
   y = t + (Cg/2) - 128;
// scale up Y so its at the same magnitude as Co and Cg - the bottom empty bit never gets sent
   y <<= 1; 
}

The next thing is to be careful with the DCT normalization. A lot of the DCT code out there does funny scalings on the various values. You can test this by taking an 8x8 block, setting to all zero, put a 1 in one of the bins, and then run your IDCT. The output will have little values all over, but the squared sum (L2 norm) should be a constant value no matter where you put that 1. If your DCT is not Unitary, you can use this to apply scalings to fix it.

So now we have a DCT + color conversion that's correct about value scaling. The DCT I used increases the dynamic range up to 12 bits (from 8) but the large values are very rare; the typical maximum value is around 512, but 4095 does occur (for example if your block is all 255, the DC value you be very large).

Now, for sample ordering by importance, the right thing obviously is the KLT. We want to send the samples of each block in an order where the first has the most energy, the next has the most of the remaining energy, etc.. This is exactly the KLT (aka the PCA). Now, we can't actually use the KLT because we get to send zero side information. (that is, a decoder must be able to grab one block at random and decode it, there's zero header). Fortunately, for most images, the DCT is very close to the KLT. That is, the first coefficient has the largest expected value, the second coefficient has the next largest, in order. Note that this is without any perceptual scaling or quantization - it's simply a property of the transform that it is rotating energy towards the lower ceofficients for most cases. If the input were random, the output would also be random with every sample having equal expected value.

Now we have to code the data; I tried two ways, they performed similarly, I'm not sure which is better. Both coders use arithmetic coding. That's largely just because it's easy for me, it's not inherent to this method necessarily. Arithmetic coding just makes it easy for me to make sure I'm not writing redundancy in the stream. Both methods use non-adaptive arithmetic coders. That means probabilities are stored as constants, not transmitted. I write the whole block lossless, then just truncate it. To decode, I take the truncated block and implicitly stuff zeros after the code stream, so if the arithmetic decoder reads past the end, it gets zero bits. The code stream must be designed correctly so that zero bits in the code stream always select the most likely value on decode.

CodeTree

This is a tree-based bitplane coder similar to EZW or the "EZDCT" that I wrote long ago. It uses zigzag scan and a "parent" relationship. Bits are sent as 2x2 "blocks". A block has a single parent bit. In the 8x8 DCT there are 16 blocks, each block has a parent in the previous "subband" formed by pretending the DCT is a wavelet transform (see the old EZDCT papers for more details). (there are many pages on the standard wavelet tree structure ; see : for example )

Bits are sent in a fixed order : first loop on bit planes, then loop on blocks, then loop on color planes. That is :


top bit plane :
   block 0 Y
   block 0 Co
   block 0 Cg
   block 1 Y
   block 1 Co
   ...
next bit plane :
   block 0 Y
   block 0 Co
   ...

The blocks are traversed in zigzag order, starting with block 0 = the DC (the upper-left most 2x2 LL in a wavelet interpretation).

Each block that we send consists of 4 bits in the current bitplane level. We pack those 4 bits together to make a value in [0,15]. When we send that block, we have some context information. We have already sent all of those samples previous bitplanes, so we can use the values of the bits in those samples previous bitplanes (in fact, we just use a boolean for each sample - were any of the previous bits on or not). We have also sent the parent sample's bits up to and including the current bitplane. Again I just use a single boolean - was it on. This is just a kind of "significance" mask that is familiar if you know about wavelet coding at all.

The bits within a block are obviously correlated with each other and this is modeled *implicitly* by coding then as a combined 4-bit symbol. Using the combined symbol also means I get 4 bits in one arithmetic decode. When a sample is previously significant, then the remaining bits are close to random (though not quite, zero is slightly more likely). When a sample is not previously significant, but the parent is, it's quite likely that the sample will become significant. If the sample is not previously signficant, and neither is the parent, then it's very likely the bit will still be zero. This is what the arithmetic coding gets us. In particular the first few bit planes that we send are almost always all zero and we need to send that very compactly. So previous zeros strongly predict more zeros. These probabilities were measured on a large test set and stored as constants.

When a sample first becomes signifcant (its first nonzero bit is sent) its sign is then immediately encoded as a raw bit. Note that there are also some special case contexts. The 0th Y sample has no parent - its parent is treated as always on. The 0th Co and Cg use the 0th Y as their parent. All other samples have a parent available and code normally.

The decoder works in the obvious way, but there is one detail : missing bit reconstruction.

When the decoder runs out of bits to read (it hits the end of the fixed size block), it has probably only read some of the samples and some of the bit planes. Because of the way we ordered things, usually what has been dropped is primarily the high frequency information in the Co and Cg. Note that we didn't do any CoCg subsampling, but implicitly that tends to happen automatically, because that's what gets truncated out. Often we have lost quite a few of the bottom bits of the samples. If we just stop decoding, those bottom bits are currently zero, like :

3 bottom bits not received
sample value = 40
sample bits : 0 0 0 0 1 0 1 [ 0 0 0 ]
[] bits not received

Now, just leaving them zero is not too bad, and it's what many people do, but we can do better. This mainly becomes important at very low bit rates (4 bits per pixel or less) because we are discarding a lot of bottom bits. What we want to do is reconstruct those bottom bits to their expected value in the range.

That is, we've gotten the top bits, so we know the true value is in the range [40,47]. What we should do is restore the value to the average (weighted by probability) in that range. The values are not equally likely, so it's not just the middle. Generally DCT coefficients are pretty close to laplacian, centered at zero. The exact distribution depends on the image, or even the block. One thing we could do to be very precise is measure the energy of what we did successfuly receive and try to guess the laplacian distribution for this block. I just measured the average in practice, and found on most images it is around 40% of the gap. That is, in this case it would be :

3 bottom bits not received
add 8 * 0.4 = 3.2 -> 3
sample value = 43
sample bits : 0 0 0 0 1 0 1 0 1 1

BTW if the previous bits received are all zero, you must not do this, restore the unreceived bits to zero. Obviously more generally what you should be doing is modeling the unreceived bits based on the received bits and the average energy of the block.

Now, I have a suspicion that you could do something neater here for visual quality. Rather than always restore to 40% of the gap, randomly restore to somewhere in 0% - 50% of the gap (or so). Note that if you restore to zero, the error looks like smoothing. That is, when you restore the unreceived bits to zero, you are letting the shape of the transform basis functions show through. Restoring the unknown to random will reproduce detail - it just might not be the right detail. At low bit rate, you can either get an image which becomes very smooth looking, or an image that become grainy in a weird way. In order for this to be ideal you need some per-block sensitivity - if you can tell that the block is pretty noisy, then he grainy random restoration is probably better; if the block looks pretty smooth, then restoring to zero is probably better.

CodeLinear

CodeTree is pretty straightforward (it's very similar to standard EZW type coders) but it does require a lot of constants to be tweaked. There are 16 symbols (4 bit block) and 8 contexts, so there are 16*8 = 128 int constants used by the arithmetic coder. This is undesirable for various reasons, one of which is the risk of overtraining to the training set.

So I wanted a coding scheme that used fewer tweak constants and came up with CodeLinear. CodeLinear throws away the tree structures and codes bits in linear order, as you might guess from the names. The bits are ordered by expected importance, and coding is done carefully to try to send each bitplane in order of most valuable to least valuable bits.

Again I always send the bitplanes in order from top bit to bottom bit. Within each bitplane, the samples and colors are no longer scanned in a simple order.

Instead we take all 8*8*3 (=192) samples, and reorder them based on expected importance. For simplicity, I use a static reorder that was computed on the training set. Note that unlike the arithmetic coding coefficients, this reorder is not succeptible to overtraining because it's almost always similar on all images and even if you get it slightly wrong it doesn't hurt too badly.

Because of this ordering we expect the samples bits to turn on roughly in order. That is, expect sample at index 0 to turn on first, then sample 1, 2, 3, etc. That is, we expect them to look like :


1 0 0 0 0 0 0 0
- 1 1 0 1 0 0 0
- - - 1 - 1 1 0
- - - - - - - 1
- - - - - - - -

(the vertical axis is the bitplane #, the horizontal axis is the sample #)
dashes are bits in samples that were previously significant in higher bit-planes
the 0s and 1s are part of the "significance front"

Or something like that - the index of the highest bit is generally decreasing.

We then send the bits in significance passes. We keep track of the highest-indexed sample that was significant in a previous bitplane, we'll call it "max_significant". (obviously this starts at zero). That is, for all samples indexed > max_significant , all of their bits sent so far are zero. For samples <= max_significant some may be significant and some may not be (of course the one *at* max_significant is always significant).

To send a bit plane we send three passes over the bits (note that using max_significant we don't actually have to walk over all the bits, in fact we generally rarely touch a bit more than once).

Pass 1 : previously signficant. For all indeces <= max_significant , if the value had previous bits on, then send the current bit. The current bit is likely to have a lot of information. It's arithmetic coded with a fixed probability.

Pass 2 : in signficant range. For all indeces <= max_significant , if the value did not have previous bits on (e.g. not coded in pass 1), then send the bit with another arithmetic probability. If the bit was on, that value is now significant, then send its sign bit raw.

Pass 3 : indeces > max_significant. These values all had no previous bits on. Send whether the current bit is now on with an arithmetic coded bit, if it comes on send the sign and also move max_significant up to this index as it is now on. We usually are sending lots of zero bits here, and the few on bits that we send are close to the start. For efficiency reasons, we thus code this differently. At each step first send a flag whether all remaining bits are zero or not. If they are not, we know at least one on bit follows. We then send how many indeces away that one bit is. We send this distance with unary. So we are doing two binary arithmetic codes, but we never actually send a "bit on/off" code. The decoder pseudocode looks like :


while( arithDecodeBit() ) // any remaining on
{
   max_significant ++;
   while( arithDecodeBit() ) // decode unary
      max_significant++;   
   // bit at max_significant now turns on :
   sample[ max_significant ] |= current_bit;
   // decode sign too
}

note that you have to arrange your code so that a zero bit terminates the loops, thus when the stream is truncated by zeros you don't get infinite loops.

That's it, pretty simple. We do reconstruction of missing bits like CodeTree does. You can see it uses 4 binary arithmetic codes, which means 4 constants which must be tweaked. Note that binary arithmetic coding with constant probabilities can be made very very fast. You could in fact use table-based coders that do no multiplies or divides (see the Howard-Vitter papers for example which are not patented). Ideally you would tweak a few different sets of 4 constants for different image types and be able to select a constant set for each image. Depending on your situation this may or may not be possible.

Note that there's an obvious possibility for improvement in CodeLinear if you did some R/D optimization in the encoder. In particular, if a value with a high index comes on prematurely all by its lonesome, it can cost a lot of bits to send and not contribute much to distortion. This value should really just get quashed. For example if the bits look something like :


1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 
- 1 1 0 0 0 0 0 0 0 0 0 0 0 0 - 0 
- - - 1 1 1 1 0 0 1 1 0 0 0 0 - 0 
- - - - - - - 1 1 - - 1 1 1 1 - 0
- - - - - - - - - - - - - - - - 1

that one lone bit way out there really boofoos things, it makes the whole area preceding it come on too soon. Note that there are also obviously more sophisticated models of "significance" that you could use other than the simple "max_significant" index thing. I haven't explored that.

02-10-09 - How to fight patents

Some rich folks like Gates and Buffet could create a consortium that just patents everything it can. It would also accept "donations" of patents from people like us who want to make sure their work is free ("defensive patents") - eg if you invent something, you submit the paper to them and they patent it.

Obviously this institution could just make all its patents free for anyone to use. But it could also use them as a hammer to force open other patents.

For example, you could make all your patents absolutely free for any other institution who does not own any patents. If you do own any patents, then you cannot use our patents unless you also make your patents completely free under the same license. (sort of like GPL).

This might be too restrictive, but on the plus side if you just got some good patents into this institute, that would force lots of other people to open up theirs under the same license, and in the ideal world it would quickly snowball.

(of course this was the idea of GPL open source code as well, and I would say that has generally been a complete failure; the real result of the GPL is that nobody serious can use GPL code, so all the GPL libraries out there are just wasted).

2/09/2009

02-09-09 - void pointer

Urg WTF why doesn't C++ let void * be implicitly cast to any other pointer type !? It's fucking void * obviously I'm using that because it's untyped or type-unknown so I'm going to be casting it. Forcing me to cast in situations like that where it's obvious just makes me more inclined to cast all the time without thinking about it, which is much worse.

Being too strict about enforcing stupid rules is very negative. It makes people ignore the rules. You need to complain only when it really matters.

2/04/2009

02-04-09 - Exceptions

I'm not really high on the whole Herb Sutter super-exception safe everything, but they are really nice for error handling some times. They have two big wins in my opinion (vs. just having error return values and checking with if statements) :

1. Consolidating error handling code and putting the error-handling code in the place that actually can respond to the error in a reasonable way.

2. Correct default behavior if you are lazy or screw up and don't handle the error - it automatically propagates out rather than just being silently ignored. Standard if-checking badly violates one of my guiding principles in coding : if you don't write the right code, it just silently breaks; I always want my programs to noisily scream when I don't write the right code.

Anyway, I'm thinking about this because handling Async IO errors is a huge pain. The big standard problem comes from file opens. The thing is, my "fopen" doesn't actually open the file, it just queues an open request that will get done at some time in the future. If the open fails (say because the file doesn't exist), you don't actually see that until you try to get a char off the file and it's no good.

With if-checking it looks like :


File = fopen();
// no need to check error here really because this did nothing

... do other work to give it some time

char c = fgetc(File);
// ! boom this can fail because the file didn't actually open

// so fgets is no good, you have to do :
char c;
EStat status = fgetc(File,&c);
if ( status == error )
   ... clean up file, back out work already done, christ

With exceptions it's a million times cleaner. You just say the File ops can throw various IOExceptions. One of those IOExceptions is "FileNotFound". That exception might be thrown in the fopen() or in the next fgetc() or whatever.

In fact with exceptions and class back-out you can also have it automatically undo the temp work you did :

try {

File = fopen();

class X;
X.DoStuff();

char c = fgetc(File); // this might throw if the open actually failed

X.Read(File);

... okay, all good. now :

World.Apply(X);

} catch

Sadly I can't use exceptions for various reasons (aside from all the practical problems with exceptions, lots of clients have them disabled, so there you go), but my god the error stuff for this async IO is so nasty. One way I'm handling it is by sort of simulating the exception code style without exceptions.

What I do is still use an fgetc() that just returns a char. If the file fails to open, fgetc() just returns zeros, and an error flag is set in the file. That way you can go ahead and write straight line code without checking errors, and then once in a while you can check the error flag. So the code looks like :

{
File = fopen();

class X;
X.DoStuff();

char c = fgetc(File); // this just sets File.error

X.Read(File);

if ( File.ok )
   World.Apply(X);
// else X will just be destructed and not be loaded into the world

}

Sadly it's not actually so neat in practice. In particular with many layers of wrapping, exceptions just do What You Want magically. With an error flag I have to be pushing it through manually all the time. A lot of my files are actually a raw disk file, wrapped in a double-buffer file layer, wrapped in an LZ decompress layer. Each layer has to push out the status to the parent, and errors can occur at any point in the code because of the async nature.

2/02/2009

02-01-09 - Swap and Templates Part 2

So I'm back into the nightmare of std::swap and template overloading. I've been working on my cb::hash_table a bit, which I built on cb::vector so I'm seeing a lot of stuff with vector. One issue is when the vector changes capacity, you have to copy all the data over and delete the old ones. The normal way to implement vector is to do this with a bunch of uninitialized copy constructor calls, then call the destructor on all the old data. But that copy can be very expensive, and worse if your objects are large in memory it means you can temporarily double your memory use.

So I had the idea to use swap. When the vector capacity changes, first go ahead and default construct everything in the new space so it's valid. Then ::swap the new to the old, and then destruct the old vector. This is of course actually not ideal, what you really want is a single call that would "swap_construct" and object so that you don't have to default construct anything. But anyhoo, for many types of data it's easy to make a very cheap default constructor and a very fast swap.

Apparently this idea is old, and in fact newer versions of the Dinkum STL do this, they call it Swaptimization which you can find on that page if you search for swap. BTW it's also obvious that sometimes swapping is worse, eg. for vector< int > - but it's never much worse. If you wanted maximimum speed on silly cases like that, you could use a type-traits template to choose whether the class prefers swapping or copying. Actually in my old versions of my own vector I had a lot of type-traits to select the most efficient operation, but I've ripped that all out now. I find it too fragile and just also not important. (note that most STL implementations now use type traits for acceleration; I think this is really loathsome, because it's not exposed in a standard way, so you can't get to it for your own types - it makes them fact on basic types and their own types, and slow on your types, which blows - the whole idea of generic programming is that you can adapt it to work on your types, but the standards people have recently started giving special treatment to things in the STL, for example the STL gets different name lookup rules now (wtf)). (for type traits in the STL see for example stuff like "_IsOKToMemCpy" and "has_trivial_destructor" in STLport.)

Anyhoo, the problem now is that you need to make sure the right swap() is called. Ruh roh. You can't really correct overload std::swap in the std namespace ; see for example : STL defects #225-227 or PDF post about swap problems by Alan Griffiths . or C++ Standards - The "swap" Problem

So, what might you do instead? One option is to define your own swap() in your own namespace, and rely on the Koenig lookup to find your swap (cuz your objects are in your namespace). That doesn't work, partly because the STL calls std::swap explicitly. I'm not sure why they do that instead of just calling swap(). Apparently new MSVC uses ADL_Swap or some funny mumbo jumbo to try to fix this.

In any case, I have a hacky solution that I'm using for now.

I go into the STL headers to the definition of std::swap. It should look something like this :

   template < class _Tp >
   inline void swap(_Tp& __a, _Tp& __b) {
     _Tp __tmp = __a;
     __a = __b;
     __b = __tmp;
   }   

I change that to :

   template < class _Tp >
   inline void swap(_Tp& __a, _Tp& __b) {
     MySwap(a,b);
   }   

Now in my own code I have MySwap :

template < typename Type >
struct swap_functor
{
   void operator () (Type &a,Type &b)
   {
      Type c = a; a = b; b = c;
   }
};

template< typename Type > inline void MySwap(Type &a,Type &b)
{
   swap_functor< Type >()(a,b);
}

Which redirects MySwap to a functor, and the default implementation is the usual swap.

The reason to go to a functor is that you can take advantage of partial specialization due to the stronger support for class templates as opposed to function templates. Thus you can easily partial-specialize to all types of vectors (for example) :

template < class t_entry > 
struct swap_functor< cb::vector< t_entry > >
{
   void operator () ( cb::vector< t_entry > & _Left, cb::vector< t_entry > & _Right)
   {
      _Left.swap(_Right);
   }
};

The reason we do the redirects like this is that when you call anything in std::algorithms, they will call std::swap. First of all they will call the STL-provided overrides for std::swap for various STL types (like std::vector and std::string). Then they will call to the default, which calls MySwap. MySwap will go to the specializations that I wanted for my types. Finally it will fall back to the default swap.

This is the only way I know of to get specialization for both my types and the STL types. Note that to get the specializations on the std types, you have to call std::swap() - it's the "outermost" swap, and you always need to call to the leaves.

Now, for example, if I wanted to also get somebody else's library integrated, call it "somelib" that implements "somelib::swap" , I would do all the above, but the cb::swap_functor default implementation would be changed to call somelib::swap. Again everyone should call std::swap to get all the specializations.

Obviously this is not ideal because it involves digging into other people's libraries.

BTW using Swap in vectors is obviously semantically much better, even if it isn't an optimization. If you consider default constructed objects to be "nulls" and objects that have been set up and pushed into a vector to be "initialized", then the standard copy way of growing a vector temporarily makes a bunch of new "initialized" objects. The swap way just moves them and doesn't temporary change the number. A common case is stuffing something like a shared_ptr ref-counted pointer into a vector - the swap way keeps the ref counts all invariant, while the copy way temporarily bumps the refs up then back down again.

BTW one thing I noticed while testing this is that std::sort doesn't seem to use swap (!?). I always thought it did. Other algorithms, like std::reverse and std::random_permutation do call std::swap. sort seems to invoke the copy constructor and assignment a lot. The RAD version rr::sort does work entirely through swap, which means it should beat std::sort very trivially in bad cases, like vector< vector< int > >

ADDENDUM : thanks to the commenters; I think what I like best is to make my swap_functor default to doing a *byte* swap. Then any object which wants to be swapped and can't stand to be byte swapped (very rare) would override swap_functor. Doing it that way means almost nobody ever has to override swap_functor, it just uses the default (byte swap) almost always.

BTW the cool way to byte swap is something like this :


#pragma pack(push)
#pragma pack(1)
template < int n_count >
  struct Bytes { char bytes[n_count]; };
#pragma pack(pop)

template < typename t_type >
void ByteCopy(t_type * pTo,const t_type & from)
{
   typedef Bytes< sizeof(t_type) > t_bytes;

   *(reinterpret_cast< t_bytes * >(pTo)) = reinterpret_cast< const t_bytes & >(from);
}

template < typename t_type >
void ByteSwap(t_type & a,t_type & b)
{
   typedef Bytes< sizeof(t_type) > t_bytes;

   t_bytes c; // don't use t_type cuz we don't want a constructor or destructor
   ByteCopy(&c,reinterpret_cast< const t_bytes & >(a));
   ByteCopy(&a,b);
   ByteCopy(&b,reinterpret_cast< const t_type & >(c));
}

The nice thing about this is for small types the compiler does the right thing and just generates mov instructions. For large types you would want to call a memswap function, but you would never do this on large types so punt.

1/30/2009

01-30-09 - SetFileValidData and async writing

Extending and async writing files on Windows is a bad situation. See for example Asynchronous Disk I/O Appears as Synchronous on Windows . Basically you should just not even try, but if you really want to give it a go :

1. Open your file for OVERLAPPED and also for NO_BUFFERING. Buffering helps reads, but it severely hurts writes. It takes my write speeds from 80 MB/sec down to like 30 MB/sec. I suspect that the reason is that buffering causes the pages to be read in before they are written out. (it's sort of like using cached memory - it will fetch in the pages even though you're doing nothing but stomping all over them).

2. Use the undocumented NtSetInformationFile to resize the file to its full size before you write anything. SetEndOfFile will only work for page size granularity, NtSetInformationFile can do arbitrary sizes. BTW this is also better for fragmentation than just writing lots of data onto the end of the file, which can cause NTFS to give you lots of allocation chains.

3. Use SetFileValidData to tell Windows that whatever is in the sectors on disk is okay. If you don't use SetFileValidData, Windows will first zero out each sector before you get to touch it. This is like a security thing, but obviously it's pretty bad for perf to basically write the whole file twice. SetFileValidData will fail unless you first ask for your process to get the right permissions, which of course will only work for processes running as administrator. Okay, I did all that. this post is okay but dear god don't read the thread.

If you do all those things right - your WriteFile() calls will actually be asynchronous. The actual speed win is not huge. Part of the problem is the next issue :

When I do all that, I start hitting some kind of weird OS buffer filling issue. I haven't tried to track down exactly what's happening because I don't really care that much, but what I see is that the writes are totally async and very fast (> 100 MB/sec) for some random amount of time (usually up to about 10 MB of writing or so) and then suddenly randomly start having huge delays. The write speed then goes down to 40 MB/sec or so.

ADDENDUM : when I say "you should not even try" I mean you should just live with WriteFile() being synchronous. It's plenty fast. Just run it from a thread and it still looks async to your thread (you need a thread anyway because OpenFile and CloseFile are very slow and synchronous; in fact the only thing you can rely on actually being fast and async is ReadFile). Also just live with the fact that Windows is zeroing the disk before you write it, everyone else does.

01-30-09 - Stack Tracing on Windows

There are basically 3 ways to capture stack traces on Windows.

1. Manually walking ebp/esp ; this steps back through the frame pointers, it relies on the callers stack being pushed. This is basically what RtlCaptureStackBackTrace or DmCaptureStackBackTrace does, but you can also just write it yourself very easily. The advantage of this is it's reasonably fast. The disadvantage is it doesn't work on all CPU architectures, and it doesn't work with the frame pointer omission optimization.

For info on RtlCaptureStackBackTrace, see Undocumented NT Internals or MSDN

2. StackWalk64. This is the new API you're supposed to use. The advantage is it works on all CPUs and it even works with frame pointer omission (!). But you can see from that latter fact that it must be very slow. In order to work with FPO it loads the PDB and uses the instruction pointer map to figure out how to trace back. It also can trace through lots of system calls that normal ebp-walking fails on.

See gamedev.net or ExtendedTrace for examples. But it's really too slow.

3. Manual push/pop in prolog/epilog. Uses the C compiler to stick a custom enter/leave on every function that does a push & pop to your own stack tracker. Google Perftools has an option to work this way. The "MemTracer" project works this way (more on MemTracer some day). The nice thing about this is it works on any architecture as long as the prolog/epilog is supported. The disadvantage is it adds a big overhead even on functions that you never trace. That rather sucks. Stacktraces are very rare in my world, so I want to pay the cost of them only when I actually do them, I don't want to be pushing & popping stack info all the time.

1/28/2009

01-28-09 - Graph Memory

So we have a thing to track memory allocs with a stack trace, la di da, no big whoop. I log it all out to a file. So I wrote a thing to parse them into a hierarchy and spit them out with tabs for tabview . That's awesome.

Then I thought, hey, Atman makes these awesome graphs with "graphviz" so maybe I'll try that. One disadvantage of the pure hierarchy view in tabview is that you can't really see the flow when lines merge back up. That is, call graphs are not *trees* they're *DAGs*. Sometimes the stack hierarchy forks apart but then comes back together again. Graphviz should be able to show this neatly. Graphviz makes "SVG" files that you can just load with Firefox (Firefox 3's support is much better than 2's).

Anyway I made some graphs with various options and it's kinda cool. Here's an example : Allocs1.svg (you'll need to use ctrl-wheel to zoom out to see anything). Apparently if you click this link Firefox does nothing good. You have to download it - then open it with Firefox. WTF. Apparently it's my Verio servers doing the wrong thing . Yegads I hate the web.

Not bad, but I'm a little disappointed with graphviz's layout abilities. In particular the actual cell and edge layout seems very good, but they seem to have literally zero code to try to put the labels in good places.

In a bit of odd deja vu, this was one of the very first things I ever worked on as a professional programmer in 1991; I worked for a GIS company that had an old COBOL GIS database engine that worked with Census data and such; I wrote them a new C front end with graphics for PC's and such, and one of the things you have to do is take this raw street data with lat/long coordinates and street names and do some nice munging to make it look okay; a big part of that is a lot of heuristics about putting the labels for streets in good places (and when to repeat the label, etc.).


ADDENDUM : talking to Sean I realized you really want the graph to have different sizing/color options, be hierarchical, interactive, and stable.

That sounds hard, but I don't think it actually is. The key thing that makes it easy is that there is very good hierarchy in this information, and you can create the graph incrementally. I think that means you can just use simple iterative penalty methods to make the graph stable.

Here's my proposal :

Start with the graph at very coarse granularity ; maybe directory granularity of you have a few directories and that makes sense, else file granularity. Whatever coarse level so you have < 32 nodes or so. Just use a solver like graphviz to make this initial graph.

Now, interactively the user can click any group to expand its hierarchy. When that happens the big cell splits into various pieces. You just create the new pieces inside the parent and make the new edges - and then you just let them time evolve with a semi-physical iterative evoluton.

You apply a penalty force for intersection with neighbors to drive the nodes apart so there's no overlap. You similarly apply forces with the edges to make them never intersect edges. And the edges also act kind of like springs, applying forces to try to be short and straight. Stable 2d physics is a pretty solved problem so you just let them run until they settle down. Note that as they spread apart they can force the other nodes in the graph to move around, but it's all nice and smooth and stable.

I think it's much easier to treat the canvas as just infinitely large and let your nodes move apart all they need to. Graphviz does everything oriented towards being printed on a page which is not necessary for the interactive view.


ADDENDUM 2 : after figuring out some evil things I've got graph_viz making better graphs. Here's an example : allocs2 SVG (direct clicking should work now too).

See comments on this post for details.

I'm still not happy with the edge labeling, and the FF SVG navigation is not ideal, but it's useable now.

1/27/2009

01-27-09 - Oddworld Memory Memories

I've been working on some memory allocator related junk recently (just for laughs I made my SmallAllocator in cblib work lock free and timed it running on a bunch of threads; an alloc takes around 200 clocks; I think there may be some cache contention issues). Anyway it reminded me of some funny stuff :

Munch's Oddysee was a wild painful birth; we were shipping for Xbox launch and crunching like mad at the end. We had lots of problems to deal with, like trying to rip as much of the fucking awful NetImmerse Engine out as possible to get our frame rate up from 1 fps (literally 6 months before ship we had areas of the game running at 1 fps) (scene graphs FTL). And then there were the DMusic bugs. Generally I found that dealing with XBox tech support guys, the graphics and general functionality guys have been really awesome, really helpful, etc. Eventually we trapped the exception in the kernel debugger and it got fixed.

Anyhoo... in the rush to ship and compromise, one of the things we left out was the ability to clean up our memory use. We just leaked and fragmented and couldn't release stuff right, so we could load up a level, but we couldn't do a level transition. Horrible problem you have to fix, right? Nah, we just rebooted the Xbox. When you do a level transition in Munch, you might notice the loading screen pops up and is totally still for a few seconds, and then it starts moving, and the screen flashes briefly during that time. That's your Xbox rebooting, giving us a fresh memory slate to play with. (shortly after launch the Xbox guys forbid developers from using this trick, but quite a few games in the early days shipped with this embarassing delay in their level load).

For Stranger's Wrath we used my "Fixed Restoring" Small Allocator, which is a standard kind of page-based allocator. We were somewhat crazy and used a hefty amount of allocations; our levels were totally data driven and objects could be different types and sizes depending on designer prefs, so we didn't want to lock into a memory layout. The different levels in the game generally all get near 64 MB, but they use that memory very differently. We wanted to be able to use things like linked lists of little nodes and STL hashes and not worry about our allocator. So the Small Allocator provides a small block allocator with (near) zero size overhead (in the limit of a large # of allocations). That is, there are pages of some fixed size, say 16 KB, and each page is assigned to one size of allocation. Allocations of that size come out of that page just by incrementing a pointer, so they're very fast and there's zero header size (there is size overhead due to not using a whole page all the time).

That was all good, except that near the end when all the content started coming together I realized that some of our levels didn't quite fit no matter how hard we squeezed - after hard crunching they were around 65 MB and we needed to get down to 63 or so, and also the Fixed Restoring Allocator was wasting some space due to the page granularity. I was assigning pages to each size of allocation, rounding up to the next alignment of 8 or 16 or 32. Pages would be given out as needed for each size bucket. So if a size bucket was never used, pages for that size would never be allocated.

This kind of scheme is pretty standard, but it can have a lot of waste. Say you allocate a lot of 17 byte items - that gets rounded up to 24 or 32 and you're wasting 7 bytes per item. Another case is that you allocate a 201 byte item - but only once in the entire game ! You don't need to give it a whole page, just let it allocate from the 256-byte item page.

Now in a normal scenario you would just try to use a better general purpose allocator, but being a game dev near shipping you can do funny things. I ran our levels and looked at the # of allocations of each size and generated a big table. Then I just hard-coded the Fixed Restoring allocator to make pages for exactly the sizes of allocations that we do a lot of. So "Stranger's Wrath" shipped with allocations of these sizes :


static const int c_smallBlockSizes[] = { 8,16,24,32,48,64,80,96,116,128,136,156,192,224,256,272 };

(The 116, 136, 156, and 272 are our primary GameObject and renderable types). (yes, we could've also just switched those objects to a custom pool allocator for the object, but that would've been a much bigger code change, which is not something I would want to do very close to shipping when we're trying to be in code lockdown and get to zero bugs).

01-25-09 - Low Level Threading Junk Part 2.02

Some more tidbits before I come back to this :

There is a misconception being widely spread that x86 can reorder reads (Herb Sutter and lots of good people have been repeating this). So far as I can tell that's just not true. The IA-32 spec says that writes don't move past writes nor do reads move past reads. (there are lots of other constraints in there).

x86 plain old load and store are acquire and release. Note that that does NOT mean they have a total order (though if your code is written right for acquire/release semantics you don't need a total order). However volatile (locked) ops on x86 do have a total order.

BTW I am very much not suggesting that you or I write code that relies on the quirks on the x86. It's better to be very careful and generally correct and mark all the places that you are assuming different types of sequencing. If those sequencing commands turn into NOPs on x86, then bully for you. Still, it helps me to actually know what's going on in our systems, and if we're going to say things, let's try to say them right; my god there's so much completely wrong information about this stuff out there. This Gamasutra article is just chock-full of wrong.

I found this guy Anthony Williams who seems to know what's up : intel-memory-ordering-and-c++-memory-model and acquire/release vs. sequential consistency

Also, Bartosz in the famous post where he gets it wrong talks about the different memory model constraints. One is :

# memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won�t be moved before it (yes, even that is not guaranteed on all platforms!).

This is for the dependent read case. This is the same place that Linux uses "smp_read_barrier_depends". The typical scenario is like :


   Object * ptr = s_shared_ptr;
   smp_read_barrier_depends();
   int local = ptr->data;

or in C++0x style :

   Object * ptr = s_shared_ptr.load(memory_order_consume);
   int local = ptr->data;

Note that Bartosz says this in a funny way. The issue is not that the compiler or the CPU reorder buffer can move the dependent read before the pointer read. Obviously that's impossible. The issue is that for purposes of *memory timing* it could look as if the dependent read was moved earlier. Say I write this bad code :


   Object * ptr = s_shared_ptr;
   int local = ptr->data;

which is 

   mov ecx   , &s_shared_ptr
   mov eax , [ecx] // eax = ptr
   mov edx , [eax + 4] // data is at ptr+4

Can never be

   mov ecx   , &s_shared_ptr
   mov edx , [eax + 4] // data is at ptr+4
   mov eax , [ecx] // eax = ptr

Obviously there's no way to execute this out of order, the chip and the compiler are not *broken*. But it can look like it went out of order if your cache architecture allows it.

Say some other thread wrote s_shared_ptr->data , then s_shared_ptr. It used a Release so they went to the bus in order. But your chip is crazy dumb and loads cache lines in random order. Your chip reads the line with s_shared_ptr in it, and then your code runs :


   Object * ptr = s_shared_ptr;
   int local = ptr->data;

What you see is the *new* value of s_shared_ptr , but the *old* value of ptr->data. Now your dumb cache pulls in ptr->data but it's too late. We see that our code *acted* like it read ptr->data before ptr.

Fortunately this doesn't happen on modern chips (the Alpha is the only chip I know of that can do this). However to be really correct about marking up your code's memory model semantically you should include the memory_order_consume or smp_read_barrier_depends. Then your compiler can turn those into NOPs ;)

Now it's a little bit of a mystery to me exactly how processors manage this. I think it must be that the caches talk to each other and they must invalidate pages in temporal order or something.


BTW I really don't like the idea of a C++ atomic<> class, or the Magic Microsoft Volatile. The problem with both of those is they hide where the code is doing really crucial synchronization things. You can have code that looks like :


   newnode->next = node->next;
   node->next = newnode;
   return node->data;

and it's secretely using atomic<> or volatile and it only works because it's relying on those to do the right acquire/release stuff. Dear god.

The really scary thing about both of those is that they are deceptively magical. They work very neatly. Most of the time. And they make it very easy for any programmer who doesn't know anything to go ahead and write some lock free code. Yikes.

I would much rather see everyone continue to use locks, and when you really do need to do lock-free stuff, the C++0x proposal for specifically marking up the semantics needed with memory model constraints is pretty good IMO. It clearly marks what the code requires and what the ordering constraints are.

To say this more concisely : I'd rather see atomic<> functions in the language rather than atomic data types. Because it's really the operations that are "atomic" or not. But it looks like I won't get my way and we're all going to be smothered by an avalanche of broken thready code in the next 10 years.

1/26/2009

01-25-09 - Low Level Threading Junk Part 2.01

I'm going to update Part 2 with some fixes and I'd like to write a Part 3 about some actual lock-free madness, but those may have to wait a tiny bit. In the mean time I thought I would post some links.

These are some very good summaries, mainly by the MSDN guys :

MSDN - Lockless Programming Considerations for Xbox 360 and Microsoft Windows
MSDN - Concurrency What Every Dev Must Know About Multithreaded Apps
MSDN - Understanding Low-Lock Techniques in Multithreaded Apps
MSDN - Synchronization and Multiprocessor Issues (Windows)
Intel - Multiple Approaches to Multithreaded Applications

These articles have a good discussion of the memory models of various processors ; in particular Part 1 has a nice table that shows what each processor does (warning : there may be some slight wrongness in here about x86).

Memory Ordering in Modern Microprocessors, Part I
Memory Ordering in Modern Microprocessors, Part II

Here come a mess of links about memory models and barriers and what's going on. There's a lot of contradiction (!!) so be careful : For example - does LFENCE do anything at all on x86 for normal aligned integral reads? good question... btw SFENCE definitely does do nothing (I think?).

Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering
YALFAm (Yet Another Lock Free Approach, maybe) - comp.programming.threads Google Groups
Who ordered memory fences on an x86 � ��Bartosz Milewski�s Programming Cafe
The Double-Checked Locking is Broken Declaration
Synchronization and the Java Memory Model
Re Memory fence instructions on x86
Re LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
Re Intel x86 memory model question 2
Re Intel x86 memory model question 1
Memory Barriers, Compiler Optimizations, etc. - comp.programming.threads Google Groups
LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
cbrumme's WebLog Memory Model

And now some links about some real mad shit. Do not try any of this at home. I've found two mad geniuses : Chris Thomasson ("the AppCore guy") who has got some nutty x86 lock-free synchronization stuff that relies on details of x86 and does safe communication with zery fencing, and Dmitriy V'jukov ("the Relacey guy") who has written a ton of nutty awesome stuff for Intel.

Wait free queue - comp.programming.threads Google Groups
Thin Lock Implementation � ��Bartosz Milewski�s Programming Cafe
Paul E. McKenney Read-Copy Update (RCU)
Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
julian m bucknall Hazard Pointers (careful - wrong stuff in here)
Dr. Dobb's Lock-Free Queues July 1, 2008
Computer Laboratory - Cambridge Lock Free Library
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
Atomic Ptr Plus Project
AppCore A Portable High-Performance Thread Synchronization Library
Re Non-blocking queue for the 2 threads (AppCore)
refcount - mostly lock-free word-based atomic reference counting (AppCore)
Asymmetric rw_mutex with atomic-free fast-path for readers - Scalable Synchronization Algorithms Google Groups
Asymmetric Dekker Synchronizatio

There's also Intel's TBB :

Threading Building Blocks Home

It's a horrible name, it's not "Threading Building Blocks" at all. It's fucking OpenMP. It's not really a lib of little threading helpers, which would be handy, it's like fucking CUDA for CPUs.

Also just to repeat - I am in no way recommending that anybody go down this path. As I have said repeatedly - CRITICAL_SECTION is really fucking fast and smart and does exactly what you want. Just use it. If critical section locks are giving you problems it's because your architecture is wrong, not because the micro-routine is not fast enough.

01-25-09 - Low Level Threading Junk Part 2

Okay, we're starting to learn some primitive threading junk. So let's try to use it do something useful.

Non-trivial statics in C++ are hideously non-thread-safe. That should be extremely obvious but if you don't see it, read here : The Old New Thing C++ scoped static initialization is not thread-safe, on purpose! . Obviously plain C static initialization of integral types is fine (BTW floats are NOT fine - beware using const floats in C++). Beware all things like this :


Object & Instance() { static Object s; return s; }

void func(int i)
{
    static Object once;
    once.DoStuff(i);
}

Not thread safe. For the most part, we should just not use them. It's very nice to do minimal work in CINIT. Much bruhaha has been made about the "thread safe singleton" (see all these) :

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Double-checked locking - Wikipedia, the free encyclopedia

For my money this is a little silly because there's a really damn trivial solution to this. The main point of the old C++ initialize-on-demand Singleton pattern (like Instance() above) is to make it work during CINIT with complex initialization order dependencies. Now, CINIT we know is run entirely on one thread before any threads are made. (you are fucking nuts if you're making threads during CINIT). That means we know any Singletons that get made during CINIT don't have to worry about threading. That means you can use a trivial singleton pattern, and just make sure it gets used in CINIT :


static Object * s_instance = NULL; // okay cuz its a basic type

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        s_instance = new Object;
    }
    return s_instance;
}

static volatile Object * s_forceUse = GetSingleton();

The forceUse thing just makes sure our GetSingleton is called during cinit, so that we can be sure that we're all made by the time threads come around. The GetSingleton() that we have here is NOT thread safe, but it doesn't need to be ! (btw I assume that "CreateThread" acts as a MemoryBarrier (??))

Okay, that's nice and easy. What if you want a Singleton that doesn't usually get made at all, so you don't want to just make it during CINIT ? (you actually want it to get made on use). Okay, now we have to worry about thread-safing the construction.

The easiest way is you know your Singleton won't get used during CINIT. In that case you can just use Critical Section :


static CCriticalSection s_crit;     // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * s_instance = NULL;

// broken version :

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        s_crit.Lock();

        s_instance = new Object;
    
        s_crit.Unlock();
    }
    return s_instance;
}

Okay, that was easy, but it's horribly broken. First of all, there's no gaurantee that Object is only made once. The thread can switch while the instruction pointer is between the first if() and the critsec lock. If that happens, some other thread can get in and make s_instance, and then when we come back to execute we run through and make it again. (If you like, you could say we put the critsec in the right place - we could fix it by moving the critsec out of the if). Even aside from that the line that assigns s_instance is all wrong because the pointer to s_instance is not necessarilly being written atomically, and it might be written before the stuff inside Object is written. What did we learn in Part 1?

// good version :

static CCriticalSection s_crit;     // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance ) // must be memory_order_consume
    {
        s_crit.Lock();

        if ( ! s_instance )
        {
            Object * pNew = new Object;
        
            InterlockedExchangeRelease( &s_instance , pNew );
        }   
    
        s_crit.Unlock();
    }
    return s_instance;
}

This is a "double-checked lock" that works. The purpose of the double-check is to avoid taking the critsec when we can avoid doing so. If instance is NULL we take the critsec, then we have to check again to make sure its still null, then we rock on and make sure that the Object's memory is flushed before s_instance is set, by using a "Release" memory barrier. Also using Interlocked ensures the pointer is written atomically.

ADDENDUM : I need to add some more notes about this. See comments for now.

Okay, that's all good, but it doesn't work if you now ever try to use this Singleton from CINIT. The problem is that s_crit might not be constructed yet. There's a pretty easy solution to that - just check if s_crit has been initialized, and if it hasn't, then don't use it. That works because we know CINIT is single threaded, so you can do something like :


class Object { } ;
static CRITICAL_SECTION s_crit = { 0 };
AT_STARTUP( InitializeCriticalSection(&s_crit); );
static Object * s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        if ( s_crit.DebugInfo == 0 ) // DebugInfo will always be nonzero for an initialized critsec
        {
            // must be in CINIT !
            s_instance = new Object;
        }
        else
        {
            EnterCriticalSection(&s_crit);

            if ( ! s_instance )
            {
                Object * pNew = new Object;
        
                InterlockedExchangeRelease( &s_instance , pNew );
            }   
    
            LeaveCriticalSection(&s_crit);
        }
    }
    return s_instance;
}

This actually works pretty dandy. Note that in CINIT you might take the upper or lower paths - you have no way of knowing if GetSingleton() will be called before or after the critsec is initialized. But that's fine, it works either way, by design. Note that we are crucially relying here on the fact that all non-trivial CINIT work is done after all the simple-type zeroing.

Okay, so that's all fine, but this last thing was pretty ugly. Wouldn't it be nice to have a critical section type of mutex object that can be statically initialized so that we don't have to worry about CINIT mumbo jumo ?

Well, yes it would, and it's pretty trivial to make a standard simple Spin Lock thing :


typedef LONG SpinMutex; // initialize me with = 0

void SpinLock(SpinMutex volatile * mut)
{
    // 0 is unlocked, 1 is locked
    COMPILER_ASSERT( sizeof(SpinMutex) == sizeof(LONG) );
    int spins = 0;
    while( InterlockedCompareExchange((LONG *)mut,1,0) == 1 ) // Acquire
    {
        ++spins;
        if ( spins > CB_SPIN_COUNT )
        {
            Sleep(0);
        }
    }
    // *mut should now be 1 
}
    
void SpinUnlock(SpinMutex volatile * mut)
{
    // *mut should now be 1
    InterlockedDecrement((LONG *)mut); // Release
    // *mut should now be 0
    // if there was a higher priority thread stalled on us, wake it up !    
}

Note that SpinLock will deadlock you if you try to recurse. Okay, now because this thing is static initializable we can use it to make a Singleton and be safe for CINIT :

static SpinMutex s_mutex = 0;
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        SpinLock(&s_mutex);

        if ( ! s_instance )
        {
            Object * pNew = new Object;
        
            InterlockedExchangeRelease( &s_instance , pNew );
        }   
    
        SpinUnlock(&s_mutex);
    }
    return s_instance;
}

And that works too.

Now, people get tempted to use these simple SpinLocks all over. *DON'T DO IT*. It looks like it should be faster than CRITICAL_SECTION but in fact it's way way slower in bad cases.

First of all, what is a CRITICAL_SECTION ? See : Matt Pietrek - Critical Sections Under Windows .

It starts out as a simple spin lock like the above. It does the same kind of thing to busy-spin the processor first. But then it doesn't just Sleep. This kind of spin-and-sleep is a really really bad thing to do in heavy threading scenarios. If you have lots of threads contending over one resource, especially with very different execution patterns the spin-and-sleep can essentially serialize them (or worse). They can get stuck in loops and fail to make any progress.

Once CRITICAL_SECTION sees contention, it creates a kernel Event to wait on, and puts the thread into an altertable wait. The Windows scheduler has lots of good mojo for dealing with events and threads in altertable waits - it's the best way to do threading sleeping and wake ups generally. For example, it has mojo to deal with the bad cases of heavy contention by doing things like randomly choosing one of the threads that is waiting on a critsec to wake up, rather than just waking up the next one (this prevents a few runaway threads from killing the system).

One important note : you should almost alwayse use "InitializeCriticalSectionAndSpinCount" these days to give your crit secs a spinner. That's because more about this in a moment.

About performance : (from one of the MSDN articles) :


    * MemoryBarrier was measured as taking 20-90 cycles.
    * InterlockedIncrement was measured as taking 36-90 cycles.
    * Acquiring or releasing a critical section was measured as taking 40-100 cycles.

which clearly shows that the idea that a critical section is "too slow" is nonsense. I've said this already but let me emphasize :

    Most of the time (no contention) a Crit Sec *is* just an InterlockedIncrement
        (so how could you beat it by using InterlockedIncrement instead?)

    When there is contention and the Crit Sec does something more serious (use an Event) it's slow
        but that is exactly what you want !!

In fact, the big win from "lock free" is not that InterlockedIncrement or whatever is so much faster - it's that you can sometimes do just one interlocked op, unlike crit sec which requires two (one for Enter and one for Leave).

An interesting alternative is this QLock thing by Vladimir Kliatchko : Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007 (QLock) ; it's quite a bit like the MS critsec, in that he uses a simple interlocked op to check for contention, and if it's not free then he makes an Event and goes into a wait state and all that. The code is there on Dr. Dobbs but you have to do the thing where you hit the "Download" in the top bar and then navigate around to try to find it. it's here in MUTEX.LST .


Ok, now a bit more InitializeCriticalSectionAndSpinCount. The reason you want a Spin is because basically every new machine these days is "multiprocessor" (multicore). That's a big difference from threading on a single core.

If you're threading on a single core - memory will never change underneath you while you are executing. You can get swapped out and memory can change and you can get swapped in, so it looks like memory changed underneath you, but it doesn't happen in real time. With multiple cores memory can be touched by some other core that is running along at full speed.

Often this doesn't affect the way you write threading code, but it does affect performance issues. It means you can have contention on a way finer scale. In a normal OS single proc environment, you get large time slices so other threads aren't frequently randomly poking at you. With real multicore the other guy can be poking at you all the time. A good range is to spin something like 1000-5000 times. 1000 times is a microsecond

What the spin count does is let you stay in the same thread and avoid an OS thread switch if some other processor is holding the lock. Note that if it's a thread on the same processor - spinning is totally pointless (in fact it's harmful).

1/25/2009

01-25-09 - Low Level Threading Junk Part 1

Okay, this is definitely one of those posts where I'm no expert by a long shot so I'll probably write some things that are wrong and you should correct me. By "low level" I mean directly accessing shared variables and worrying about what's going on with the memory, as opposed to just using the language/OS constructs for safe sharing.

Let me say up front that writing lots of low-level thread-sharing code is a very very bad idea and should not be in 99% of the cases. Just use CriticalSection and once in a while use Interlocked and don't worry about the tiny inefficiency; if you do things right they won't be a speed hit. Trying to get things right in lots of low level threading code is a recipe for huge bugs.

I'm going to assume you know about race conditions and basic threading primitives and such. If you're hazy, this is a pretty good introduction : Concurrency What Every Dev Must Know About Multithreaded Apps . I'm also going to assume that you know how to do simple safe thread interaction stuff using Interlocked, but maybe you don't know exactly what's going on with that.

First of all, let me try to list the various issues we're dealing with :

1. Multiple threads running on one core (shared cache). This means you can get swapped in and out; this is actually the *easy* part but it's what most people think of as threading.

2. Multiple cores/processors running multiple threads, possibly with incoherent caches. We have #1 to deal with, but also now the memory views of various threads can be different. The memories sync up eventually, but it's almost like they're talking through a delated communication channel.

3. CPU OOP instruction reorder buffers and cache gather/reorder buffers. Instructions (even in ASM) may not execute in the order you wrote, and even if they do exectue in that order, memory reads & writes may not happen in the order of the instructions (because of cache line straddle issues, write gather buffers, etc.)

4. Single CISC CPU instructions being broken into pieces (such as unaligned accesses). Single ASM instructions may not be single operations; things like "inc" become "load, add, store" and there's an opportunity for a thread to interleave in there. Even apparently atomic ops like just a "load" can become multiple instructions if the load is unaligned, and that creates a chance for another thread to poke in the gap.

5. The compiler/optimizer reordering operations. Obviously things don't necessarily happen in the order that you write them in your C program.

6. The compiler/optimizer caching values or eliminating ops

I think that's the meat of it. One thing that sort of mucks this all up is that x86 and MSVC >= 2005 are sort of special cases which are much simpler than most other compilers & platforms. Unfortunately most devs and authors are working with x86 and MSVC 2005+ which means they do lazy/incorrect things that happen to work in that case. Also I'm going to be talking about C++ but there are actually much better memory model controls now in Java and C#. I'm going to try to describe things that are safe in general, not just safe on x86/Windows, but I will use the x86/Windows functions as an example.

Almost every single page I've read on this stuff get its wrong. Even by the experts. I always see stuff like this . Where they implement a lock-free fancy doohicky, and then come back later and admit that oh, it doesn't actually work. For example this Julian Bucknall guy has a lot of long articles about lock free stuff, and then every 3rd article he comes back and goes "oh BTW the last article was wrong, oops". BTW never try to use any of the lock free stuff from a place like "codeproject.com" or anybody's random blog.

I've read a lot of stuff like :

Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model.

To my mind, that's a really silly thing to say. Basically C/C++ doesn't *have* a memory model. Modern C# and Java *do* which means that the language natively supports the ideas of memory access ordering and such. With C/C++ you basically have zero gaurantees of anything. That means you have to do everything manually. But when you do things manually you can of course do whatever you want. For example "double checked locking" is just fine in C, but you have to manually control your memory model in the right way. (I really don't even like the term "memory model" that much; it's really an "execution model" because it includes things like the concept of what's atomic and what can be reordered).

Some things I'm going to talk about : how lock free is really like spin locking, how critical sections work, why you should spincount critical sections now, what kind of threading stuff you can do without critsecs, etc.

Something I am NOT going to talk about is the exact details of the memory model on x86/windows, because I don't think you should be writing code for a specific processor memory model. Try to write code that will always work. x86/windows has strong constraints (stores are not reordered past stores, etc. etc. but forget you know that and don't rely on it).


Let's look at a simple example and see if we can get it right.

Thread A is trying to pass some work to thread B. Thread B sits and waits for the work then does it. Our stupid example looks like :


// global variables :
MyStruct * g_work = NULL;


Thread A :

int main(int argc, char ** argv)
{
    g_work = new MyStruct( argc, argv );

    ....
}


void ThreadB()
{

    while( g_work == NULL )
    {
    }

    g_work->DoStuff();
}

Okay, now let's go through it and fix it.

First of all, this line should give you the heebee jeebees :

    g_work = new MyStruct( argc, argv );
It's doing a bunch of work and assigning to a shared variable. There are no gaurantees about what order that gets written to memory, so g_work could be assigned before the struct is set up, then ThreadB could start poking into it while I'm still constructing it. We want to release a full object to g_work that we're all done with. We can start trying to fix it by doing :
1.  MyStruct * temp = new MyStruct( argc, argv );
2.  g_work = temp;
that's good, but again you cannot assume anything about the memory model in C or the order of operations. In particular, we need to make sure that the writes to memory done by line 1 are actually finished before line 2 executes.

In Windows we do that by calling MemoryBarrier() :

    MyStruct * temp = new MyStruct( argc, argv );
    MemoryBarrier();
    g_work = temp;
MemoryBarrier is an intrinsic in MSVC ; it actually does two things. 1. It emits an instruction that causes the processor to force a sync point (this also actually does two things : 1A : flushes caches and write gather buffers and 1B. puts a fence in the reorder buffer so the processor can't speculate ahead). 2. the MemoryBarrier instrinsic also acts as a compiler optimizer barrier - so that the MSVC compiler won't move work before MemoryBarrier ahead of MemoryBarrier.

MemoryBarrier is a full memory fence, it creates an ordering point. In general if you just write memory operations :

    A
    B
    C
you can't say what order they actually happen in. If another thread is watching that spot it might see C,A,B or whatever. With MemoryBarrier :
    A
    B
    MemoryBarrier
    C
You get an order constraint : C is always after {AB} , so it might be ABC or BAC.

Another digression about the compiler optimizer fence : in windows you can also control just the compiler optimization with _ReadWriteBarrier (and _ReadBarrier and _WriteBarrier). This doesn't generate a memory fence to the processor, it's just a command to the optimizer to not move memory reads or writes across a specific line. I haven't seen a case where I would actually want to use this without also generating a memory barrier (??). Another thing I'm not sure about - it seems that if you manually output a fence instruction with __asm, the compiler automatically treats that as a ReadWriteBarrier (??).

Alright, so we're getting close, we've made a work struct and forced it to flush out before becoming visible to the other thread :

    MyStruct * temp = new MyStruct( argc, argv );
    MemoryBarrier();
    g_work = temp; <-
What about this last line? It looks inocuous, but it holds many traps. Assignment is atomic - but only if g_work is a 32 bit pointer on 32-bit x86, or a 64 bit pointer on 64-bit x86. Also since g_work is just a variable it could get optimized out or deferred or just stored in local cache and not flushed out to the bus, etc.

One thing we can do is use "volatile". I hesitate to even talk about volatile because it's not well defined by C and it means different things depending on platform and compiler. (In particular, MS has added lots of threading usefulness to volatile, but nobody else does what they do, so don't use it!). What we want "volatile" for here is to force the compiler to actually generate a memory store for g_work. To my mind I like to think that volatile means "don't put me in registers - always read or write memory". (again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that!). Note you might also have to make sure g_work is aligned unless you are sure the compiler is doing that.

One thing to be careful with about volatile is how you put it on a pointer. Remember to read pointer adjective from right to left in C:

    volatile char * var;

    // this is a non-volatile pointer to a volatile char !! probably not what you meant

    char * volatile var;

    // this is a volatile pointer to chars - probably what you meant
    //  (though of course you have to make sure the char memory accesses are synchronized right)

Note that volatile is a pretty big performance hit. I actually think most of the time you should just not use "volatile" at all, because it's too variable in its meaning, and instead you should manually specify the operations that need to be sync'ed :

On Windows there's a clean way to do this :


    MyStruct * temp = new MyStruct( argc, argv );
    InterlockedExchangePointer( &g_work, temp );

The Interlocked functions are guaranteed to be atomic. Now we don't have to just hope that the code we wrote actually translated into an atomic op. The Interlocked functions also automatically generate memory barriers and optimizer barriers (!! NOTE : only true on Windows, NOT true on Xenon !!). Thus the InterlockedExchangePointer forces MyStruct to get written to temp first.

Let me just briefly mention that this full MemoryBarrier is *very* expensive and you can get away with less. In particular, something you will see is "Acquire" and "Release". The heuristic rule of thumb is that you use "Acquire" to read shared conditions and "Release" to write them. More formally, "Acquire" is a starting memory barrier - it means any memory op done after the Acquire will not move before it (but ones done before can move after). "Release" is a finishing memory barrier - memory ops done before it will not move after (but ones done after can be done before).

So if you have :

    A
    B
    C - Release
    D
The actual order could be {A B C D} or {B A C D} or {A B D C} but never {A C B D}. Obviously Acquire and Release are a slightly weaker constraint than a full barrier so they give the processor more room to wiggle, which is good for it.

So lets do another example of this simple thread passing (stolen from comp.programming.threads) :


int a;
int b;
LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

  EnterCriticalSection(&mtx);
  if (! valid) {
    a = something;
    b = something;
    InterlockedExchangeAddRelease(&valid, 1);
    // writes to memory 'a' and 'b' will be done before valid is set to 1
  }
  do_other_stuff();
  LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

  if (InterlockedExchangeAddAcquire(&valid, 0)) {
    return a + b;
  } else {
    return 0;
  }

} 

int function3 () { // broken

  if ( valid ) {
    return a + b;
  } else {
    return 0;
  }

} 

Now it's easy to fall into a trap of thinking that because we did the "Release" that function3 is okay. I mean, by the time function3 sees valid get set to 1, 'a' and 'b' will already be set, so function 3 is right, okay? Well, sort of. That would be true *if* function 3 was in assembly so the compiler couldn't reorder anything, and if the chip couldn't reorder memory ops (or if we rely on the x86 constraint of read ordering). You should know the actual execution of function 3 could go something like :


fetch a
fetch b
add b to a
test valid
set conditional a to 0
return a

which is now reading a and b before 'valid'. Acquire stops this.

Some good links on this basic memory barrier stuff : (read Kang Su in particular)

Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Memory barrier - Wikipedia, the free encyclopedia
Acquire and Release Semantics
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Memory ordering, barriers, acquire and release semantics niallryan.com
Memory barrier + volatile (C++) - comp.programming.threads Google Groups

BTW note that volatile does some magic goodness in VC >= 2005 , but *not* on Xenon even with that compiler version. In summary :


ReadBarrier/WriteBarrier - just prevents compiler reordering

MemoryBarrier() - CPU memory load/store barrier

Interlocked & volatile
    Interlocked functions on Windows automatically generate a compiler & a CPU barrier
    Interlocked on Xbox 360 does not generate a CPU barrier - you need to do it

special volatile thing in MSVC ?
    volatile automatically does the right thing on VC >= 2005
    not for XBox though

ADDENDUM : lol the new Dr. Dobbs has a good article by Herb called volatile vs volatile . It covers a lot of this same territory.

And some more links that are basically on the same topic : (the thing with the valid flag we've done here is called the "publication safety pattern")

Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering

old rants