10/11/2010

10-11-10 - Windows 1252 to ASCII best fit

I'd like to construct a Windows 1252 to ASCII (7 bit) best fit visual character mapping (eg. accented a -> a , vertical bar linedraw -> | , etc.). I can't find it. ... okay I did it ..

const int c_windows1252_to_ascii[256] = 
{
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
 96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,
 35,102, 34, 46, 35, 35, 94, 35, 83, 60, 79, 35, 90, 35, 90, 35,
 35, 39, 39, 34, 34, 46, 45, 45,126, 84,115, 62,111, 35,122, 89,
 32, 33, 99, 35, 36, 89,124, 35, 35, 67, 97, 60, 35, 45, 82, 35,
 35, 35, 50, 51, 35, 35, 35, 46, 44, 49,111, 62, 35, 35, 35, 35,
 65, 65, 65, 65, 65, 65, 65, 67, 69, 69, 69, 69, 73, 73, 73, 73,
 68, 78, 79, 79, 79, 79, 79, 35, 79, 85, 85, 85, 85, 89, 35, 35,
 97, 97, 97, 97, 97, 97, 97, 99,101,101,101,101,105,105,105,105,
 35,110,111,111,111,111,111, 35,111,117,117,117,117,121, 35,121
};

( see this table as visible chars at cbloom.com )

this was generated by the Win32 functions and it's not perfect. It gets the accented chars right but it just gives up on the funny chars and it puts the "default" char in, which I set to "#" (35) which is probably not the ideal choice.

So anyway, it would be better to have a hand-tweaked table if you can find it.

Some links I found that were not particularly helpful :

Index of PublicMAPPINGSVENDORSMICSFTWindowsBestFit
Character sets
Cast of Characters- ASCII, ANSI, UTF-8 and all that
ASCII Table 7-bit
ASCII Character Map
ANSI character set and equivalent Unicode and HTML characters
A tutorial on character code issues


Also, in further news of the printf with wide chars considered harmful front, I've discovered it can cause execution to break, not merely fail to convert the string well.

I get some wchar string from some perfectly reasonable source (such as MultiBytetoWideChar or from a file name) and try to print it with printf %S (capital S for wide chars). The problem is at this point (output.c in the MSVC CRT) :


    e = _WCTOMB_S(&retval, L_buffer, _countof(L_buffer), *p++);
    if (e != 0 || retval == 0) {
        charsout = -1;
        break;
    }

because it's decided that the wchar is no good for some reason. wctomb_s will fail "if the conversion is not possible in the current locale". It winds up failing the whole printf (which causes it to return -1 and set errno). WTF don't fail my entire printf because you can't map one of the wchars. So fucked.

(I also have no clue why this particular wchar was failing to convert; it was like a squiggly f looking thing, it showed up just fine in the MSVC watch window, but for some reason the CRT locale shit didn't like it).

see :

_set_invalid_parameter_handler (CRT)
_setmbcp (CRT)
setlocale, _wsetlocale (CRT)

Anyway, my recommended best practice remains "don't use wide chars in printf" , unless you use autoprintf and let it convert them to console code page for you. (note that wstrings are converted automatically, but raw wchars you have to call ToString() to make them convert)

If you use autoprintf and put this somewhere, it will handle the std string variants nicely :


START_CB
inline const char * autoprintf_StringToChar (const std::string & rhs)
{
    return rhs.c_str();
}

inline const String ToString( const std::wstring & rhs)
{
    return autoPrintfWChar(rhs.c_str());
}
END_CB

10/09/2010

10-09-10 - Game Controls

I'm playing a bit of Xbox 360 for the first time ever on the loaner box from work (side note : jebus it makes a hell of a lot of racket, the DVD is crazy loud, the fans are crazy loud, it's pretty intolerable; I also get disc unreadable errors semi-frequently which I guess is because the work box is screwed, but between this and the Red Ring problems, I can only conclude that this console is built like a piece of shit, just epically bad manufacturing quality).

Anyway, one thing that I find very depressing is that the games have almost uniformly bad basic controls. This is the most fundamental aspect of any game and you all should be ashamed that you don't do it right.

Perhaps the most frustrating and unforgivable of them is that so many games just drop inputs. This happens because the game or the player is in some kind of state where you are not allowed to do that action. The result is that they just lose the button press. eg. say you have a jump and an attack. You can't do them at the same time. I jump, and then right about the time that I'm landing I hit attack. That needs to work whether or not I hit the attack button right before I land or right after I land.

There are various ways to do this, but an easy one is to store a little history of all inputs that have been seen in the last few millis (200 ms or so is usually enough), and whether or not they have been acted upon or not. Each frame, you act not just on the new inputs that are seen that frame, but also on inputs seen recently that have not been acted upon.

For doing things like "holding down A makes you sprint" , you need to be checking for button "is down" *not* the edge event. Unbelievably, some major games do this wrong, and the result is that they can miss the "is down" event, so I'm running around holding A and I'm not sprinting. (even if you do all the processing right as above, the player might press A down during the inventory screen or something like that when you are ignoring it).

For simultaneous button press stuff, you also need to allow some slop of course. Say you want to do something when both A and B are pressed. When you see an "A down" event you need to immediately start doing the action for A to avoid latency, but then if you see a "B down" within 50 millis or whatever, then you need to cancel the A action and start an "AB simultaneous press" action.

One pretty bad trend is that lots of people are using something like Havok for physics and they run their player motion through it. They are often lazy about this and just get it working the same way one of the Havok demos did it, and the result is incredibly janky player motion, where you stutter, get caught on shit in the world, etc. There's a mistaken belief that more finely detailed collision geometry is better. Not so. As much as I dont' really love the old Quake super-slidey style of motion that feels like the world is all greased up, it's better than this stuttering and getting caught on shit. The player's collision proxy should be rotationally symmetric around the vertical axis - you shouldn't be able to get stuck on junk by rotating. A vertical lozenge is probably the best player collision proxy.

A context-dependent generic "action" button is a perfectly reasonable thing to do and lots of games do it now, *however* it should not be one of your major buttons. eg. when you are not at an "action" location, the action button needs to do something pretty irrelevant to gameplay, eg. it should not be your "jump" or "attack" button or something important. That is, just because you're at an action location, you shouldn't lose the ability to do something major that you

Automatic lockons for combat and such are okay, but you need to provide a way to get in and out of that mode, or a button to hold down to override it or something. Specifically, you should never take over control of the player's movement or camera without their having some ability to override it.

It's well known that for 1st person games you need to provide an "invert look" option, but IMO it's just as important to provide inverts for 3rd person cameras too, not just up-down but also left-right. Everyone has a different idea of what the right way for a 3rd person camera to move is, so just expose it, it's fucking trivial, do it.

It's very depressing to me that devs don't spend time on this basic shit.

It basically comes down to an incorrect philosophy of design. The incorrect philosophy is :

we have given the player a way in which it is possible for them to make the character do what they want

the correct philosophy is :

the character should at all times do what the player obviously wants it to do

and any time you fail in that, you have a major deficiency that should be fixed. Any time you see someone press a button and the character doesn't do what they wanted, don't coach the player on how to do it right - fix it so that the game just does the right thing.


A couple other minor points :

You have to be careful about the keys that you use in your menu system and what they do in the game world. This is a general important UI issue (on mouse-based interfaces, you should be aware of where buttons on popups lie on top of buttons underneath them - the issue is if a user clicks the popup multiple times and it disappears they may accidentally click beneath it, so don't put a "delete all my work" button under the popup).

In games in particular, if you have A = okay and B = back in your menu system, then you should make sure that if the user is hammering on "B" and comes out of the menu system and applies to the game world, that it won't fuck them.

Context-sensitive and tap/hold buttons and modifiers (eg. hold R1 changes the action of "A") are all okay for overloading, but *modal* controls are never okay. That is, if I press R1 it changes my mode and then my buttons do different things. Modal controls are absolutely awful because it removes the players ability to act from muscle memory. It means I can't just hammer on "A" in a panic and get the expected result, I might wind up doing the wrong thing and going "fuck I'm in the wrong mode". Even worse are modal controls where the mode is not displayed clearly on screen in a reliable place because the game is trying do an "immersive" HUD-less interface.

I imagine most people would agree that modal controls are terrible, but in fact many games do them. For example any time you press a button to go into a selection wheel for spell casting or dialog or something like that it's a modal control (a better way to do this is hold the button down to bring up the wheel, but really selection wheels are pretty fucking awful in general).

That is, for each action in the game there should be a unique physical player motion. eg. "fireball" should be hold L1 , dpad up, release. Actions should never move around on the wheel, they should be in a reliable place, and there should never be any conditionals in the action sequence to do a certain action.

Another type of modality is when certain actions are not allowed due to some state of the world. eg. maybe you're near friendlies so your attacks are disabled, or you're in a cut scene so you can't save. Again most games get this wrong and don't correctly acknowledge that they have created hidden modes that affect input. The modes need to be clearly displayed - often there's no reliable way to tell you are in a certain mode or not, for example a cut scene needs to be clearly separated from normal gameplay (eg. by bringing in vignetting or letterboxing or something) if it changes my ability to use controls. If I can't do my attacks or sprint or whatever, I should at least get some ackowledgement from the game that it received my input, and it's just not doing it.

10/08/2010

10-08-10 - Optimal Baseline JPEG

One of the things we are missing is a really good modern JPEG encoder/decoder. I mentioned most of this in the WebP post, but I thought it was important enough to repeat. This would be a great project if someone wants to do it; I'd like to, I think it's actually important, not just as a fair comparison between modern coders like x264 and good old JPEG, but also because it would actually be useful to people who care about JPEG images. (eg. a common use case is you have some old jpeg and you want to decode it as well as possible.

Using normal JPEG code streams, but trying to make the encoder & decoder as good as possible, you should do something like :

Encoder :

  • RDO based structure; eg. encoder is given lambda and finds optimal R/D point. Unfortunately this has to be iterative because of huffman codes, decisions in one pass affect the huffman codes for the next pass.

  • A good perceptual metric to target. Maybe SSIM or x264's funny SATD activity thing, or something else.

  • Trellis quantization; the JPEG-huff code block structure lends itself to trellis state optimization pretty directly.

  • Better chroma subsample (aware of the up-filter).

  • Quant matrix optimization for perceptual metric.

Decoder :

  • Deblocking filter, or maybe the "Unblock" histogram non-filter approach or some combination.

  • Luma-aided chroma upsample

  • Expectation-in-bucket instead of mean-in-bucket dequantization.

  • Noise reinjection , perhaps predicting where some of the zeros in the DCT should in fact be small non-zeros.

  • Shape-aware deringing ; similar to camera denoisers, there's a lot of work on this in the literature.

10/07/2010

10-07-10 - Portable CRT God Dammit

What a god damn disaster. How is it that none of us have made our old portable wrapper for the CRT? I want a standard interface to the functions that do exist, and a standard way to ask for "does this exist", eg. something like :

#if fseek64_exists

fseek64(fp,off64,set);

#else

int32 off32 = check_value_cast_throw<int32>(off64);
fseek(fp,off32,set);

#endif

Mostly it should just be a wrapper that passes through to the underlying CRT, but for some things that are platform independent (eg. string stuff, sprintfs, etc.) it could just be its own implementation.

I see that Sean has started a portable types header called sophist that gives you sized types (int16 etc) and some #defines to check to get info about your platform. That's a good start.

For speed work you'd like some more things like "size of register" and something like "intr" (an int the size of a register) (one big issue here is whether the 64 bit type fits in a register or not). Also things like "can type be used unaligned".

Obviously C99 would help a lot, but even it wouldn't be everything. You want the stuff above that tells you a bit more about your platform and exposes low-level ops a bit. You also want stuff that's at the "POSIX" library level as opposed to just the CRT, eg. dir ops & renames and truncate and chmod and all that kind of stuff.

Every time I do portability work I think "god damn I wish I just made my own portability library" but instead I don't do it and just hack enough to make the current project work. If I had just done it the clean way from the beginning I would have saved a lot of work and been happier and made something that was useful to other people. And.. I'm just doing it the hacky way yet again.

(actually Boost addresses a lot of this but is just sick over-complex and inconsistent in quality; for example Boost.Thread looks pretty good and has Win32 condition variables for example). I also just randomly found this ptypes lib which is pretty good for Win32 vs POSIX implementations of threading stuff.

10/02/2010

10-02-10 - WebP

Well, since we've done this in comments and emails and now some other people have gone over it, I'll go ahead and add my official take too.

Basically from what I have seen of WebP it's a joke. It may or may not be better than JPEG. We don't really know yet. The people who have done the test methodology obviously don't have image compression background.

(ADD : okay, maybe "it's a joke" is a bit harsh. The material that I linked to ostensibly showing its superiority was in fact a joke. A bad joke. But the format itself is sort of okay. I like WebP-lossless much better than WebP-lossy).

If you would like to learn how to present a lossy image compressor's performance, it should be something like these :

Lossy Image Compression Results - Image Compression Benchmark
S3TC and FXT1 texture compression
H.264AVC intra coding and JPEG 2000 comparison

eg. you need to work on a good corpus of source images, you need to study various bit rates, you need to use perceptual quality metrics, etc. Unfortunately there is not a standardized way to do this, so you have to present a bunch of things (I suggest MS-SSIM-SCIELAB but that is nonstandard).

Furthermore, the question "is it better than JPEG" is the wrong question. Of course you can make an image format that's better than JPEG. JPEG is 20-30 years old. The question is : is it better than other lossy image formats we could make. It's like if I published a new sort algorithm and showed how much better it was than bubblesort. Mkay. How does it do vs things that are actually state of the art? DLI ? ADCTC ? Why should we like your image compressor that beats JPEG better than any of the other ones? You need to show some data points for software complexity, speed, and memory use.

As for the VP8 format itself, I suspect it is slightly better than JPEG, but this is a little more subtle than people think. So far as I can tell the people in the Google study were using a JPEG with perceptual quantization matrices and then measuring PSNR. That's a big "image compression 101" mistake. The thing about JPEG is that it is actually very well tuned to the human visual system (*1); that tuning of course actually hurts PSNR. So it's very easy to beat JPEG in terms of PSNR/RMSE but in fact make output that looks worse. (this is the case with JPEG-XR / HD-PHOTO for example, and sometimes with JPEG2000 ). At the moment the VP8 codec is not visually tuned, but some day it could be, and when it eventually is, I'm sure it could beat JPEG.

That's the advantage of VP8 over JPEG - there's a decent amount of flexibility in the code stream, which means you can make an optimizing encoder that targets perceptual metrics. This is also what makes x264 so good; I don't think Dark Shikari actually realizes this, but the really great thing about the predictors in the H264 I frames is not that they help quality inherently, it's that they give you flexibility in the encoder. That is, for example, if you are targetting RMSE and you don't do trellis quantization, then predictors are not a very big win at all. They only become a big win when you let your encoder do RDO and start making decisions about throwing away coefficients and variable quantization, because then the predictors give you different residual shapes, which give you different types of error after transform and quantization. That is, it lets the encoder choose what the error looks like, and if your encoder knows what kinds of errors look better, that is very strong. (it's also good just when targetting RMSE if you do RDO, because it lets the encoder choose residual shapes which are easier to code in an R/D sense with your particular transform/backend coder).

My first question when somebody says they can beat JPEG is "did you try the trivial improvements to JPEG first?". First of all, even with the normal JPEG code stream you can do a better encoder. You can do quantization matrix optimization (DCTune), you can do "trellis quantization" (thresholding output coefficients to improve R/D), you can sample chroma in various ways. With the standard code stream, in the decoder you can do things like deblocking filters and luma-aided chroma upsample. You should of course also use a good quality JPEG Encoder such as "JPEG Wizard" and a lossless JPEG compressor ( also here ). (PAQ8PX, Stuffit 14, and PackJPG all work by decoding the JPEG then re-encoding it with a new entropy encoder, so they are equivalent to replacing the JPEG entropy coder with a modern one).

(BTW this is sort of off topic, but note that the above "good JPEG" is still lagging behind what a modern JPEG would be like. Modern JPEG would have a new context/arithmetic entropy coder, an RDO bit allocation, perceptual quality metric, per-block variable quantization, optional 16x16 blocks (and maybe 16x8,8x16), maybe a per-image color matrix, an in-loop deblocker, perhaps a deringing filter. You might want a tiny bit more encoder choice, so maybe a few prediction modes or something else (maybe an alternative transform to choose, like a 45 degree rotated directional DCT or something, you could do per-region quantization matrices, etc).)

BTW I'd like to see people stop showing Luma-only SSIM results for images that were compressed in color. If you are going to show only luma SSIM results, then you need to compress the images as grayscale. The various image formats do not treat color the same way and do not allocate bits the same way, so you are basically favoring the algorithms that give less bits to chroma when you show Y results for color image compressions.

In terms of the web, it makes a lot more sense to me to use a lossless recompressor that doesn't decode the JPEG and re-encode it. That causes pointless damage to the pixels. Better to leave the DCT coefficients alone, maybe threshold a few to zero, recompress with a new entropy coder, and then when the client receives it, turn it back into regular JPEG. That way people get to still work with JPEGs that they know and love.

This just smells all over of an ill-conceived pointless idea which frankly is getting a lot more attention than it deserves just because it has the Google name on it. One thing we don't need is more pointless image formats which are neither feature rich nor big improvements in quality which make users say "meh". JPEG2000 and HD-Photo have already fucked that up and created yet more of a Babel of file types.

(footnote *1 : actually something that needs to be done is JPEG needs to be re-tuned for modern viewing conditions; when it was tweaked we were on CRT's at much lower res, now we're on LCD's with much smaller pixels, they need to do all that threshold of detection testing again and make a new quantization matrix. Also, the 8x8 block size is too small for modern image sizes, so we really should have 16x16 visual quantization coefficients).

10/01/2010

10-01-10 - Some Data Compression Corpora We Need Badly

If somebody wants a university project, these would be nice :

1. A lossless data compression corpus that is *broad* and also *representative*. That is, there are many types of data (probably 100-1000 files), some small, some large. Importantly the type of correlation structure in the data should be very diverse (eg. not just a ton of different English text files or executables). Too many of the corpora are simply too small, and even the ones that are reasonably large are too self-redundant, they wind up not containing a sample of a certain type of data that does occur in the wild.

Finally the thing that's really missing is there should be a weighting number assigned to each file such that they are given importance based on their chance of occurance in the wild. To get these numbers you could do a few different things - download every archive on thepiratebay and sample what's inside them (this gives you a sampling of the type of files people actually put in archives), or maybe put a snooper on the internet backbone and sample the total set of all data that flies on the internet. The point is that this sampling should be based on the actual frequency of various data types, not just an ad hoc composite.

2. An image set with human quality metrics. Somebody needs to take a big set of test images (32-100), munge them in various ways by running them through various compressors (as well as other ways of damaging them that aren't well known compressors), and then get actual human visual ratings on the damaged versions. Then provide all the damaged versions (or code to produce them) with the human ratings.

If we had a test set like that, we could tweak our algorithmic approximations of human quality rating (eg. SSIM etc) until they reproduce what the actual humans say. This is not a test set for image compressors, it's a test set for image quality metric training, which is what we really need to take image compressors to the next level.

9/30/2010

09-30-10 - Coder News

Ignacio wrote a nice article on GPU radiosity via hemicubes . I always wanted to try that, I'm jealous.

I suspect that you could be faster & higher quality now by doing GPU ray tracing instead of render-to-textures. The big speed advantage of GPU ray tracing would come from the fact that you can make the rays to sample into your lightmaps for lots of lightmap texels and run them all in one big batch. Quality comes from the fact that you can use all the fancy raytracing techniques for sampling, monte carlo methods, etc. and it's rather easier to do variable-resolution sampling (eg. start with 50 rays per lightmap texel and add more if there's high variance). In fact, you could set up all the rays to sample into your lightmap texels for your whole world, and then run N bounces by justing firing the whole gigantic batch N times. Of course the disadvantage to this is you have to implement your whole renderer twice, once as a raytracer and once for normal rendering, avoiding that is the whole point of using the hemicube render to texture method.

Was talking to Dave and randomly mentioned that I thought the new Rvalue references were not that interesting because you can basically do everything they give you using RVO and swaps. But then I did some more reading and have changed my mind. Rvalue references are awesome!

Want Speed Pass by Value. � C++Next (Dave Abrahams series on value semantics)
C++ Rvalue References Explained (good intro)
A Brief Introduction to Rvalue References (Howard E. Hinnant, Bjarne Stroustrup, and Bronek Kozicki)
Rvalue References C++0x Features in VC10, Part 2 - Visual C++ Team Blog - Site Home - MSDN Blogs
rvalue reference (the proposal document)
InformIT C++ Reference Guide The rvalue Reference Proposal, Part I

Basically it lets you write templates that know an object is a temporary, so you can mutate it or move from it without making a copy. I think one problem we sometime have as coders is that we think about our existing code and think "bah I don't need that" , but it's only because we have been so conditioned not to write code in that way because we don't have the right tools. C++0x makes it possible to do things in templates that you always wished you could do. That doesn't necessarily mean doing lots more complicated things, it means doing the little things and getting them exactly right. "auto" , "decltype" , "[[base_check]]", etc. all look pretty handy.

9/28/2010

09-28-10 - Branchless LZ77 Decoder

It occurred to me that if you do an LZ77 where the literal/match flag is sent via a run length of literals, then a run of literals is the same as a match, but the source is the next few bytes of the compressed buffer, rather than some previous location in the decompressed buffer.

That is, you're decompressing from "comp" buffer into "dest" buffer. A "match" is just a copy from the "dest" buffer, and literals are just at copy from the "comp" buffer.

So, let's say we do a byte-wise LZ77 , use one bit to flag literal or match, then 7 bits for the length. Our branchless decoder is something like :


{
    U8 control = *comp++;
    int source = control >> 7;
    int length = (control & 127) + 1;
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);
}

Where "select(a,b,c)" = c ? ( a : b ); or something like that. (sometimes better implemented with a negative and; on the PC you might use cmov, on the spu you might use sel, etc.)

While this should be very fast, compression will not be awesome because the division for literals and matches is not ideal and 7 bits of length is a lot for literals but not enough for matches, and offsets are always 16 bits which is too little. We can do a slightly better version using a 256 entry lookup table for the control :


{
    U8 control = *comp++;
    int source = source_table[control];
    int length = length_table[control];
    U8 * lit_ptr = comp;
    U8 * mat_ptr = dest - *((U16 *)comp);
    U8 * copy_from_ptr = select( lit_ptr, mat_ptr, source );
    memcpy( dest, copy_from_ptr, length );
    dest += length;
    comp += (source<<1);
}

for example with the table you could let the match lengths be larger and sparse. But it would probably be better to just have a branch that reads more bytes for long lengths.

Adding things like optionally larger offsets starts to make the branchless code complex enough that eating one branch is better. If you're willing to do a lot of variable shifts it's certainly possible, for example you could grab 1 control byte, look it up in various tables. The tables tell you some # of bits for match length, some # for match offset, and some # for literal run length (they add up a multiple of 8 and use some portion of the control byte as well). Unfortunately variable shifting is untenably slow on many important platforms.

BTW one useful trick for reducing branches in your LZ decoder is to put the EOF case out in some rare case, rather than as your primary loop condition, and you change your primary loop to be an unconditional branch. On PC's this doesn't change much but on some architectures an unconditional branch is much cheaper than a conditional one, even it's predictable. That is, instead of :


while ( ! eof )
{
   .. do one decode step ..
   if ( mode 1 )
     ..
   if ( mdoe 2 )
    ...
}

You do :

for(;;)
{
   .. do one decode step ..
   if ( mode 1 )
     ..
   if ( mode 2 )
   {
     if ( rare case )
     {
        if ( eof )
          break;
     }
   }
}

Also, obviously you don't actually use "memcpy" , but whatever you use for the copy has an internal branch. And of course we have turned our branch into a tight data dependency. On some platforms that's not much better than a branch, but on many it is much better. Unfortunately unrolling doesn't help the data dependency much because of the fact that LZ can copy from its own output, so you have to wait for the copy to be done before you can start the next one (there's also an inherent LHS here, though that stall is the least of our worries).

9/21/2010

09-21-10 - Waiting on Thread Events Part 2

So we've seen the problem, let's look at some solutions.

First of all let me try to convince you that various hacky solutions won't work. Say you're using Semaphores. Instead of doing Up() to signal the event, you could do Increment(1000); That will cause the Down() to just run through immediately until it's pulled back down, so the next bunch of tests will succeed. This might in fact make the race never happen in real life, but the race is still there.

1. Put a Mutex around the Wait. The mutex just ensures that only one thread at a time is actually in the wait on the event. In particular we do :


WaitOnMessage_4( GUID )
{

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;
    }

    for(;;)
    {
        MutexInScope( WaiterMutex );

        {
            MutexInScope(ReceiveMutex);
            pop everything on the receive FIFO
            mark everything I received as done
            if ( the GUID I wanted is done )
                return;
        }

        Wait( ReceiveEvent );
    }
}

note that the WaiterMutex is never taken by the Worker thread, only by "main" threads. Also note that it must be held around the whole check for the GUID being done or you have another race. Also we have the separate ReceiveMutex for the pop/mark phase so that when we are in the wait we don't block other threads from checking status.

Now when multiple threads try to wait, the first will get in the mutex and actually wait on the Event, the second will try to lock the mutex and stall out there and wait that way.

The problem with this is that threads that can proceed don't always get to. It's not broken, you will make progress, but it's nowhere near optimal. Consider if N threads come in and wait on different GUIDs. Now the worker finishes something, so that wakes up a Wait and he goes around the loop which releases the Mutex - if you just so happened to wake up the one guy who could proceed, then it's great, he returns, but if you woke up the wrong guy, he will take the mutex and go back to sleep. (also note when the mutex unlocks it might immediately switch thread execution to one of the people waiting on locking it because of the Window fair scheduling heuristics). So you only have a 1/N chance of making optimal progress, and in the worst case you might make all N threads wait until all N items are done. That's bad. So let's look at other solutions :

2. Put the Event/Semaphore on each message instead of associated with the whole Receive queue. You wait on message->event , and the worker signals each message's event when it's done.

This also works fine. It is only race-free if GUIDs can only be used from one thread, if multiple threads can wait on the same GUID you are back to the same old problem. This also requires allocating a Semaphore per message, which may or may not be a big deal depending on how fine grained your work is.

On the plus side, this lets the Wait() be on the exact item you want being done, not just anything being done, so your thread doesn't wake up unnecessarily to check if its item was done.


WaitOnMessage_5A( GUID )
{
    Event ev;

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;

        ev = message(GUID)->event
    }

    Wait( ev );
}

we're glossing over the tricky part of this one which is the maintenance of the event on each message. Let's say we know that GUIDs are only touched by one thread at a time. Then we can do the event destruction/recycling when we receive a done message. That way we know that event is always safe for us to wait on outside of the mutex because the only person who will ever Clear or delete that event is us.

I think you can make this safe for multiple threads thusly :


WaitOnMessage_5B( GUID )
{
    Semaphore ev;

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;

        message(GUID)->users ++;
        ev = message(GUID)->semaphore;
    }

    Down( ev );

    {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        ASSERT( the GUID I wanted is done );

        message(GUID)->users --;
        if ( message(GUID)->users == 0 )
            delete message(GUID)->semaphore;
    }
}

and you make worker signal the event by doing Up(1000); you don't have to worry about that causing weird side effects, since each event is only ever signalled once, and the maximum number of possible waits on each semaphore is the number of threads. Since the semaphore accesses are protected by mutex, I think you could also do lazy semaphore creation, eg. don't make them on messages by default, just make them when somebody tries to wait on that message.

3. Make the Event per thread. You either put the ReceiveEvent in the TLS, or you just make it a local on the stack or somewhere for each thread that wants to wait. Then we just use WaitOnMessage_3 , but the ReceiveEvent is now for the thread. The tricky bit is we need to let the worker thread know what events it needs to signal.

The easiest/hackiest way is just to have the worker thread always signal N events that you determine at startup. This way will in fact work fine for many games. A slightly cleaner way is to have a list of the events that need signalling. But now you need to protect access to that with a mutex, something like :


WaitOnMessage_6( GUID , LocalEvent )
{
    {
    MutexInScope(EventListMutex)
        add LocalEvent to EventList
    }

    for(;;)
    {
        {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            return;
        }

        Wait( LocalEvent );
    }

    {
    MutexInScope(EventListMutex)
        remove LocalEvent from EventList
    }

}

and worker does :

    do work
    push message to receive FIFO
    
    {
    MutexInScope(EventListMutex)
        signal everything in EventList
    }

one nice thing about this is that you can push GUID with LocalEvent and then only signal events that want to hear about that specific message.

Note with the code written as-is it works but has very nasty thread thrashing. When the worker signals the events in the list, it immediately wakes up the waiting threads, which then try to lock the EventListMutex and immediately go back to sleep, which wakes the worker back up again. That's pretty shitty so a slightly better version is if the worker does :


    do work
    push message to receive FIFO
    
    TempList;
    {
    MutexInScope(EventListMutex)
        copy everything in EventList to TempList
    }
    -*-
    signal everything in TempList

this fixes our thread thrashing, but it now gives us the usual lock-free type of restrictions - events in TempList are no longer protected by the Mutex, so that memory can't be reclaimed until we are sure that the worker is no longer touching them. (in practice the easiest way to do this is just to use recycling pooled allocators which don't empty their pool until application exit). Note that if an event is recycled and gets a different id, this might signal it, but that's not a correctness error because extra signals are benign. (extra signals just cause a wasteful spin around the loop to check if your GUID is done, no big deal, not enough signals means infinite wait, which is a big deal).

NOTE : you might think there's a race at -*- ; the apparent problem is that the worker has got the event list in TempList, and it gets swapped out, then on some main thread I add myself to EventList, run through and go to sleep in the Wait. Then worker wakes up at -*- and signals TempList - and I'm not in the list to signal! Oh crap! But this can't happen because if that was the work item I needed, it was already put in the receive FIFO and I should have seen it and returned before going into the Wait.

We can also of course get rid of the Mutex on EventList completely by doing the usual lockfree gymnastics; instead make it a message passing thing :


WaitOnMessage_6B( GUID , LocalEvent )
{
    Send Message {Add,LocalEvent,GUID}

    for(;;)
    {
        {
        MutexInScope(ReceiveMutex);
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        Wait( LocalEvent );
    }

    Send Message {Remove,LocalEvent,GUID}
}

and worker does :

    pop message from "send FIFO" to get work
    do work
    push message to "receive FIFO"

    pop all event messages
        process Add/Remove commands 

    signal everything in EventList

but this isn't tested and I have very little confidence in it. I would want to Relacey this before using it in real code.

4. Just don't do it. Sometimes domain-specific solutions are best. In particular there's a very simple solution I can use for the current problem I'm having.

The thing that made me hit this issue is that I made my work-stealing Worklet system able to wait on IO's. In this case the "worker" is the IO service thread and the messages are IO requests. So now the main thread can wait on IO and so can the Worklets. It's important that they really go to a proper sleep if they are blocked on IO.

But I can solve it in a simple way in that case. The Worklets already have a way to go sleep when they have no work to do. So all I do is if the Worklets only have work which depends on an uncompleted IO, they push their Work back onto the main work dispatch list and go to sleep on the "I have no work" event. Then, whenever the main thread receives any IO wakeup event, it immediately goes and checks the Worklet dispatch list and sees if anything was waiting on an IO completion. If it was, then it hands out the work and wakes up some Worklets to do it.

This solution is more direct and also has the nice property of not waking up unnecessary threads and so on.

09-21-10 - Waiting on Thread Events

A small note on a common threading pattern.

You have some worker thread which takes messages and does something or other and then puts out results. You implement the message passing in some way or other, either with a mutex or a lock-free FIFO or whatever, it doesn't matter for our purposes here.

So your main thread makes little work messages, sends them to the worker, he does them, sends them back. The tricky bit comes when the main thread wants to wait on a certain message being done. Let's draw the structure :


   main      ---- send --->     worker
  thread    <--- receive ---    thread

First assume our messages have some kind of GUID. If the pointers are never recycled it could be the pointer, but generally that's a bad idea, but a pointer + a counter would be fine. So the main thread says Wait on this GUID being done. The first simple implementation would be a spin loop :


WaitOnMessage_1( GUID )
{

    for(;;)
    {
        pop everything on the receive FIFO
        mark everything I received as done
    
        is the GUID I wanted in the done set?
           if so -> return

        yield processor for a while
    }

}

and that works just fine. But now you want to be a little nicer and not just spin, you want to make your thread actually go to sleep.

Well you can do it in Win32 using Events. (on most other OS'es you would use Semaphores, but it's identical here). What you do is you have the worker thread set an event when it pushes a message, and now we can wait on it. We'll call this ReceiveEvent, and our new WaitOnMessage is :


WaitOnMessage_2( GUID )
{

    for(;;)
    {
        Clear( ReceiveEvent );

        pop everything on the receive FIFO
        mark everything I received as done
    
        is the GUID I wanted in the done set?
           if so -> return

        Wait( ReceiveEvent );
    }

}

that is, we clear the receive event so it's unsignalled, we see if our GUID is done, if not we wait until the worker has done something, then we check again. We aren't sleeping on our specific work item, but we are sleeping on the worker doing something.

In the Worker thread it does :


    pop from "send FIFO"

    do work

    push to "receive FIFO"
    
    Signal( ReceiveEvent );

note the order of the push and the Signal (Signal after push) is important or you could have a deadlock due to a race because of the Clear. While the code as-is works fine, the Clear() is considered dangerous and is actually unnecessary - if you remove it, the worst that will happen is you will run through the Wait() one time that you don't need to, which is not a big deal. Also Clear can be a little messy to implement for Semaphores in a cross-platform way. In Semaphore speak, Wait() is Down() , Signal() is Up(), and Clear() is trylock().

Okay, so this is all fine and we're happy with ourselves, until one day we try to WaitOnMessage() from two different threads. I've been talking as if we have one main thread and one worker thread, but the basic FIFO queues work fine for N "main" threads, and we may well might want to have multiple threads generating and waiting on work. So what's the problem ?

First of all, there's a race in the status check, so we put a Mutex around it :


WaitOnMessage_3A( GUID )
{

    for(;;)
    {
        Mutex 
        {
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        Wait( ReceiveEvent );
    }

}

you need that because one thread could come in and pop the queue but not process it, and then another thread comes in and sees its GUID is not done and goes to sleep incorrectly. We need the pop and the flagging done to act like they are atomic.

Now you might be inclined to make this safe by just putting the Mutex around the whole function. In fact that works, but it means you are holding the mutex while you go into your Wait sleep. Then if another thread comes in to check status, it will be forced to sleep too - even if its GUID is already done. So that's very bad, we don't want to block status checking while we are asleep.

Why is the _3A version still broken ? Consider this case : we have two threads, thread 1 and 2, they make their own work and send it off then each call WaitOnMessage which is :


WaitOnMessage_3( GUID )
{

    for(;;)
    {
        Mutex 
        {
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        [A]

        Wait( ReceiveEvent );

        [B]
    }

}

Thread 1 runs through to [A] and then swaps out. Thread 2 runs through to [B], which waits on the event, the worker thread then does all the work, sets the event, then thread 2 gets the signal and clears the event. Now thread 1 runs again at [A] and immediately goes into an infinite Wait.

D'oh !

I think this is long enough just describing the problem, so I'll get at some solutions in the next post.

old rants