12/08/2008

12-08-08 - DXTC Summary

I thought I should fix some things that were wrong or badly said in my original DXTC posts :

DXTC Part 1
DXTC Part 2
DXTC Part 3
DXTC Part 4

Added later :

DXTC Addendum
DXTC More Followup
DXTC is not enough
DXTC is not enough part 2

On the "ryg" coder : there was a bug/typo in the implementation I was using which gave bogus results, so you should just ignore the numbers in those tables. See for correction : Molly Rocket Forum on Ryg DXTC coder . Also I should note he does have some neat code in there. The color finder is indeed very fast; it is an approximation (not 100% right) but the quality is very close to perfect. Also his "RefineBlock" which does the Simon Brown style optimize-end-from-indices is a very clever fast implementation that collapses a lot of work. I like the way he uses the portions of one 32 bit word to accumulate three counters at a time.

I also mentioned in those posts that the optimized version of the Id "real time DXTC" bit math was "wrong". Well, yes, it is wrong in the sense that it doesn't give you exactly the same indeces, but apparently that was an intentional approximation by JMP, and in fact it's a very good one. While it does pick different indeces than the exact method, it only does so in cases where the error is zero or close to zero. On most real images the actual measured error of this approximation is exactly zero, and it is faster.

So, here are some numbers on a hard test set for different index finders :


    exact : err:  31.466375 , clocks: 1422.256522

    id    : err:  31.466377 , clocks: 1290.232239
            diff:  0.000002

    ryg   : err:  31.466939 , clocks:  723.051241
            diff:  0.000564

    ryg-t : err:  31.466939 , clocks:  645.445860
            diff:  0.000564

You can see the errors for all of them are very small. "ryg-t" is a new one I made which uses a table to turn the dot product checks into indexes, so that I can eliminate the branches. Start with the "ryg" dot product code and change the inner loop to :

    const int remap[8] = { 0 << 30,2 << 30,0 << 30,2 << 30,3 << 30,3 << 30,1 << 30,1 << 30 };

    for(int i=0;i < 16;i++)
    {
        int dot = block.colors[i].r*dirr + block.colors[i].g*dirg + block.colors[i].b*dirb;

        int bits =( (dot < halfPoint) ? 4 : 0 )
                | ( (dot < c0Point) ? 2 : 0 )
                | ( (dot < c3Point) ? 1 : 0 );

        mask >>= 2;
        mask |= remap[bits];
    }

I should note that these speed numbers are for pure C obvious implementations and if you really cared about speed you should use SSE and who knows what would win there.


Now this last bit is a little half baked but I thought I'd toss it up. It's a little bit magical to me that Ryg's Mul8Bit (which is actually Jim Blinn's) also happens to produce perfect quantization to 565 *including the MSB shifting into LSB reproduction*.

I mentioned before that the MSB shifting into LSB thing is actually "wrong" in that it would hurt RMSE on purely random data, because it is making poor use of the quantization bins. That is, for random data, to quantize [0,255] -> 32 values (5 bits) you should have quantization bins that each hold 8 values, and whose reconstruction is at the middle of each bin. That is, you should reconstruct to {4,12,20,...} Instead we reconstruct to {0,8,...247,255} - the two buckets at the edges only get 4 values, and there are some other ones that get 9 values. Now in practice this is a good thing because your original data is *not* random - it's far more likely to have exact 0 and 255 values in the input, so you want to reproduce those exactly. So anyway, it's not a uniform quantizer on the range [0,255]. In fact, it's closer to a uniform quantizer on the range [-4,259].


I think it might actually just be a numerical coincidence in the range [0,255].

The correct straight-forward quantizer for the DXTC style colors is

    return (32*(val+4))/(256+8);

for R.  Each quantization bin gets 8 values except the top and bottom which only get 4.  That's equivalent to quantizing the range [-4,256+4) with a uniform 8-bin quantizer.

Now

1/(256 + 8) = 1/256 * 1/(1 + 8/256)

We can do the Taylor series expansion of 1/(1+x) for small x on the second term and we get ( 1 - 8/256 + 64/256/256) up to powers of x^2

So we have

    ( (32*val+128) * ( 1 - 8/256 + 64/256/256) ) >> 8

And if we do a bunch of reduction we get 

    return ( 31*val+124 + ((8*val+32)>>8) ) >> 8

If we compare this to Mul8bit :

        return ( 31*val+128 + ((val*31 + 128)>>8)) >> 8;

it's not exactly the same math, but they are the same for all val in [0,255]

But I dunno. BTW another way to think about all this is to imagine your pixels are an 8.inf fixed point rep of an original floating point pixel, and you should replicate the 8 bit value continuously. So 0 = 0, 255 = FF.FFFFFF.... = 1.0 exactly , 127 = 7F.7F7F7F..

BTW this reminds me : When I do Bmp -> FloatImage conversions I used to do (int + 0.5)/256 , that is 0 -> 0.5/256 , 255 -> 255.5/256 . I no longer do that, I do 0->0, and 255 -> 1.0 , with a 1/255 quantizer.

12/05/2008

12-05-08 - lrotl

Well I found one x86 ASM widget. I've always known you could do nice fast barrel rotations on x86 but thought they were inaccessible from C. Huzzah! Stdlib has a function "_lrotl()" which is exactly what you want, and happily it is one of the magic functions the MSVC recognizes in their compiler and turns into assembly with all goodness. (They also have custom handling for strcmp, memcpy, etc.)

Oh, I noticed lrotl in OpenSSL which seems to have a ton of good code for different hashes/checksums/digests/whatever-the-fuck-you-call-them's.

As a test I tried it on Sean's hash, which is quite good and fast for C strings :


RADINLINE U32 stb_hash(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = (hash << 7) + (hash >> 25) + *str++;
   }
   return hash;
}

RADINLINE U32 stb_hash_rot(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash : 6.43 clocks per char
stb_hash_rot : 3.24 clocks per char

Woot! Then I also remembered something neat I saw today at Paul Hsieh's Assembly Lab . A quick check for whether a 32 bit word has any null byte in it :

#define has_nullbyte(x) (((x) - 0x01010101) & ( ~(x) & 0x80808080 ) )

Which can of course be used to make an unrolled stb_hash :

RADINLINE U32 stb_hash_rot_dw(const char *str)
{
   U32 hash = 0;
   
   while ( ! has_nullbyte( *((U32 *)str) ) )
   {
      hash = _lrotl(hash,7) + str[0];
      hash = _lrotl(hash,7) + str[1];
      hash = _lrotl(hash,7) + str[2];
      hash = _lrotl(hash,7) + str[3];
      str += 4;
   }
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash_rot_dw : 2.50 clocks

So anyway, I'm getting distracted by pointless nonsense, but it's nice to know lrotl works. (and yes, yes, you could be faster still by switching the hash algorithm to something that works directly on dwords)

12-05-08 - 64 Bit Multiply

Something that I've found myself wanting to do a lot recently is multiply two 32 bit numbers, and then take the top 32 bit dword from the 64 bit result. In C this looks like :

U32 mul32by32returnHi(U32 a,U32 b)
{
    U64 product = (U64) a * b;
    U32 ret = (U32)(product >> 32);
    return ret;
}

That works fine and all, but the C compiler doesn't understand that you're doing something very simple. It generates absolutely disgusting code. In particular, it actually promotes a & b to 64 bit, and then calls a function called "_allmul" in the Windows NT RTL. This allmul function does a bunch of carries and such to make 64-by-64 bit multiply work via the 32+32->64 multiply instruction in x86. You wind up with a function that takes 60 clocks when it could take 6 clocks :

U32 mul32by32returnHi(U32 a,U32 b)
{
    __asm
    {
        mov eax, a
        mul b
        mov eax,edx
    }
}

Now, that's nice and all, the problem is that tiny inline asm functions just don't work in MSVC these days. Calling a big chunk of ASM is fine, but using a tiny bit of ASM causes the optimizer to disable all its really fancy optimizations, and you wind up with a function that's slower overall.

What I'd like is a way to get at some of the x86 capabilities that you can't easily get from C. The other thing I often find myself wanting is a way to get at the carry flag, or something like "sbb" or "adc". The main uses of those are Terje's cool tricks to get the sign of an int or add with saturation

It would be awesome if you could define your own little mini assembly functions that acted just like the built-in "intrinsics". It would be totally fine if you had to add a bunch of extra data to tell the compiler about dependencies or side effects or whatever, if you could just make little assembly widgets that worked as neatly as their own intrinsics it would be totally awesome for little ops.

ADDENDUM : urg ; so I tested Won's suggestion of Int32x32To64 and found it was doing the right thing (actually UInt32x32To64 - and BTW that's just a macro that's identical to my first code snippet - it doesn't seem to be a magic intrinsic, though it does have platform switches so you should probably use that macro). That confused me and didn't match what I was seeing, so I did some more tests...

It turns out the first code snippet of mul32by32returnHi actually *will* compile to the right thing - but only if you call it from simple functions. If you call it from a complex function the compiler gets confused, but if you call it from a little tight test loop function it does the right thing. URG.

Here's what it does if I try to use it in my StringHash interpolation search test code :


            int start = mul32by32returnHi( query.hash, indexrange );
004087A5  xor         eax,eax 
004087A7  push        eax  
004087A8  mov         eax,dword ptr [esp+38h] 
004087AC  push        eax  
004087AD  push        0    
004087AF  push        ebx  
004087B0  mov         dword ptr [esi],ebx 
004087B2  call        _allmul (40EC00h) 
004087B7  mov         ecx,dword ptr [esp+14h] 

You can see it's extending the dwords to 64 bits, pushing them all, calling the function, then grabbing just one dword from the results. Ugh.

And here's what it does in a tight little test loop :


      U32 dw = *dwp++;
      hash = mul32by32returnHi(hash,dw) ^ dw;
00401120  mov         eax,ecx 
00401122  mul         eax,edx 
00401124  add         esi,4 
00401127  xor         edx,ecx 
00401129  mov         ecx,dword ptr [esi] 

Which not only is correct use of 'mul' but also has got crazy reordering of the loop which makes it a lot faster than my inline ASM version.

12/02/2008

12-02-08 - Oodle Happy

This is from a few weeks ago, but I just pulled it up again and realized it pleases me :

Free Image Hosting at www.ImageShack.us

This is an image of the little thread profile viewer that I made which shows timelines of multiple threads. This particular portion is showing an LZ decompressor thread in the top bar and the IO thread in the bottom bar. The LZ decompressor is double buffered, you can see it ping-ponging between working on the two buffers. As it finished each buffer it fires an IO to get more compressed data. You can see those IO's get fired and run in the bottom. This portion is CPU bound, you can see the IO thread is stalling out a lot (that's partly because there's no seeking and the packed data is coming through the windows file cache).

Anyway, the thing that pleases me is that it's working perfectly. The CPU is 100% busy with decompression work and never wastes any time waiting for IO. The IO thread wakes and sleeps to fire the IOs. Yum. In the future complication and mess I will never again see such nice perfect threading behavior.

12-02-08 - H264

I hate H264. It's very good. If you were making a video codec from scratch right now you would be hard pressed to beat H264. And that's part of why I hate it. Because you have to compete with it, and it's a ridiculously over-complex bucket of bolts. There are so many unnecessary modes and different kinds of blocks, different entropy coders, different kinds of motion compensation, even making a fully compliant *decoder* is a huge pain in the ass.

And the encoder is where the real pain lies. H264 like many of the standards that I hate, is not a one-to-one transform from decompressed streams to code streams. There is no explicit algorithm to find the optimal stream for a given bit rate. With all the different choices that the encoder has of different block types, different bit allocation, different motion vectors, etc. there's a massive amount of search space, and getting good compression quality hinges entirely on having a good encoder that searches that space well.

All of this stifles innovation, and also means that there are very few decent implementations available because it's so damn hard to make a good implementation. It's such a big arcane standard that's tweaked to the Nth degree, there are literally thousands of papers about it (and the Chinese seem to have really latched on to working on H264 improvements which mean there are thousands of papers written by non-English speakers, yay).

I really don't like overcomplex standards, especially this style that specifies the decoder but not the encoder. Hey, it's a nice idea in theory, it sounds good - you specify the decoder, and then over the years people can innovate and come up with better encoders that are still compatible with the same decoder. Sounds nice, but it doesn't work. What happens in the real world is that a shitty encoder gains acceptance in the mass market and that's what everyone uses. Or NO encoder ever takes hold, such as with the so-called "MPEG 4" layered audio spec, for which there exists zero mainstream encoders because it's just too damn complex.

Even aside from all that annoyance, it also just bugs me because it's not optimal. There are lots of ways to encode the exact same decoded video, and that means you're wasting code space. Any time the encoder has choices that let it produce the same output with different code streams, it means you're wasting code space. I talked about this a bit in the past in the LZ optimal parser article, but it should be intuitively obvious - you could take some of those redundant code streams and make them decode to something different, which would give you more output possibilities and thus reduce error at the same bit rate. Obviously H264 still performs well so it's not a very significant waste of code space, but you could make the coder simpler and more efficient by eliminating those choices.

Furthermore, while the motion compensation and all that is very fancy, it's still "ghetto". It's still a gross approximation of what we really want to do, which is *predict* the new pixel from its neighbors and from the past sequence of frames. That is, don't just create motion vectors and subtract the value and encode the difference - doing subtraction is a very primitive form of prediction.

Making a single predicted value and subtracting is okay *if* the predicted probability spectrum is unimodal laplacian, and you also use what information you can to predict the width of the laplacian. But it often isn't. Often there are pixels that you can make a good prediction that this pixel is very likely either A or B, and each is laplacian, but making a single prediction you'd have to guess (A+B)/2 which is no good. (an example of this is along the edges of moving objects, where you can very strongly predict any given pixel to be either from the still background or from the edge of the moving object - a bimodal distribution).

12-01-08 - VM with Virtual Disk

I had this idea for a VM to completely sandbox individual programs. It goes like this :

Do a fresh install of Windows or whatever OS. Take a full snapshot of the disk and store it in a big file. This is now *const* and will be shared by all sandboxes.

Every program that you want to run in isolation gets its own sandbox. Initially a sandbox just points at the const OS snapshot which is shared. File reads fall through to that. When you run the installer on the sandbox, it will do a bunch of file writes - those go in a journal which is unique to this sandbox that stores all the file renames, writes, deletes, etc. That can be saved or simply thrown away after the program is done.

You can optionally browse to sandbox journals. They look just like a regular disk with files. What you're seeing is the const OS snapshot with the changes that the individual program made on top of it. You can then copy files in and out of the sandbox drive to get them to your real disk.

So, for example, when you download some program from the internet that you don't trust, you can just pop up a new sandbox and run it there. This is *instant* and the program is 100% isolated from being able to do file IO to your real system. But if it makes some files you want, you can easily grab them out.

You could also mount "portals" across the sandboxes if you want to. For example, say you don't trust shitty iTunes and you want to run it in a sandbox so it can't mess with your registry or anything. But you want your music files to be on your main drive and have those be accessible to iTunes. You can mount a portal like a net drive for the sandbox to be able to "see out" to just that directory. That way you don't have to like duplicate your music files into the iTunes sandbox or whatever.

Aside from isolating rogue programs, this fixes a lot of problems with Windows. It lets you do 100% clean uninstalls - boom you just delete the whole sandbox and the program has no fingers left over. Every program gets its own registry and set of DLLs and such so there can never be conflicts. You don't have that damn problem of Windows always mysteriously going to shit after 5 years.

If you put your OS on c: and all your data on d:, you could easily just let all the sandboxes of trusted programs have portals to d: so that you can just run Photoshop in a sandbox and browse to d: and work on images, and it feels like just run normal programs on a normal computer.

11/28/2008

11-28-08 - Optimization

Yes yes, I'm generally in the "measure carefully and only optimize where needed" camp which the pretentious and condescending love. Weblog pedants love to point out pointless optimizations and say "did you really measure it?". Fie.

There's a great advantage to just blindly optimizing (as long as it doesn't severely cost you in terms of clarity or maintainability or simplicity). It lets you stop worrying. It frees up head space. Say you do some really inefficient string class and then you use it all over your code. Every time you use you have to wonder "is this okay? am I getting killed by this inefficient class?" and then you measure it, and no, it's fine. Then a month later you worry again. Then a month later you worry again. It gives you great peace of mind to know that all your base components are great. It lets you worry about the real issue, which is the higher level algorithm. (note that micro-optimizing the higher level algorithm prematurely is just always bad).

11-28-08 - Chance of CRC being bad

Say I run the CRC32 on data chunks. I've got them identified by name so I only care about false equality (collisions) on objects of the same name. I'm going to be conservative and assume I only get 31 good bits out of the CRC so there are 2^31 values, not a full 2^32. I think around 20 versions of the data per name max. So the chance of collision is a birthday "paradox" thing (it's not a freaking paradox, it's just probability!) :

P = "1 - e^( - 20*19/(2*(2^31)) )" = 8.847564e-008

Now if you have N names, what's the chance that *any* of them has a collision?


C = 1 - (1 - P)^N

So, for what N of objects is C = 50% ?


log(0.5) = N * log(1 - P)

N = log(0.5)/(log(1-P)

N = 7,834,327

You can have 7 Million files before you get a 50% chance of any one of them having a collision.

For a more reasonable number like N = 100k , what's the chance of collision?


C = "1 - (1 - 8.847564*(10^-8))^100000" = 0.00881 = 0.881 %

These probabilities are very low, and I have been pessimistic so they're even lower, but perhaps they are too high. On any given project, an 0.881% chance of a problem is probably okay, but for me I have to care about what's the chance that *any customer* has a problem, which puts me closer to the 7 M number of files and means it is likely I would see one problem. Of course a collision is not a disaster. You just tell Oodle to refresh everything manually, and the chance is that only happens once to anybody ever in the next few years.

BTW we used CRC32 to hash all the strings in Oddworld : Stranger's Wrath. We had around 100k unique strings which is pretty close to the breaking point for CRC32 (unlike the above case, they all had to be unique against each other). Everything was fine until near the end of the project when we got a few name collisions. Oh noes! Disaster! Not really. I just changed one of the names from "blue_rock.tga" to "blue_rock_1.tga" and it no longer collided. Yep, that was a crazy tough problem. (of course I can't use solutions like that as a library designer).

11/26/2008

11-26-08 - Oodle

OMG I finally got hot loading sort of working with bundles and override bundles. So you can build a package of a bunch of files. The game runs and grabs the package that pages in. You touch some file, the game sees the change, that causes it to make a single-file-bundle for the changed file, it loads in the new bundle, and then patches the resource table so that the new version is used instead of the old one in the package. Then the Game Dictionary is updated so that future runs will also load the new version.

Phew. It works but it's fragile and I'm not happy with how careful you have to be. There are lots of ways it can fail. For example, the packages all load one by one asynchronously. If you fire the big package load first, it may come in and get hooked up before the override package. Then if you hook up your game resources, you have hooked up to the old (wrong) data. You can prevent this by doing a call on each resource that you get to say "is this the right version of this resource" before hooking up to it. But currently there's just a ton of stuff like that you have to be aware of and make sure to do the right call, etc. etc.

I keep coming back to the problem that I need to know whether I have the right version of the file in a package. There are three options I can see :

1. Fuck it. Do nothing. This requires the game to load the right versions all the time. Clients could, for example, always just work unpackaged while editing, then when they make paging packages they would simply not be editable or incremental-linkable in a nice way. This is not a happy solution but it "works" in the sense that it is *reliable* about not working.

2. Use mod times. This works fine and is easy and fast *as long as you only touch and make files on your local machine*. Which would fail for major game devs since they tend to compile files on some kind of build server, or check in and share compiled files. But if you think about the way programmers work - we don't ever check in our .OBJ files, we just check in sources and everybody has to rebuild on their local machine. If you make your artists do the same thing - only share source files and always local rebuild - OR sync to a full compiled set from the server and only run compiled, but never try to mix local changes with server compiles - then it works fine.

3. Use CRC's. This is the most robust and reliable. The packages store the CRC of the files they were made from. If the source file on disk has a different CRC, we assume that the source file is better. (we don't try to use mod times to tell who's newer because of the server-client mod time problem). If the CRC is different we repackage using the local source file. Then when we load we always prefer packages that have content that matches the local CRC. This works and is stable and good and all. The only problem is all the time you spend doing CRC's on files, which may or may not be a big deal. Obviously running over your whole tree and doing CRC's all the time would be ridiculous, but you can use mod time to tell when to redo the CRC, so you only incrementally redo it. Even getting all the mod times is too slow, so you can use a persistent Watcher app to cache the drive dir listing and file infos.

The "CRC" method is not necessarilly "right" in the sense that it doesn't load the newest version of the content, but it is reliable, it always does the same thing - it loads content that corresponds to the source version on the user's disk.

BTW this last thing is something the OS should really do but totally fails on. With the amount of memory we have now, the OS should just keep a listing of every dir you've ever visited in memory at all times. Even if you visit every dir on the disk it would only be 100 Megs. You could cap it at 64 MB or something and LRU, and obviously having 64 MB of dir listings is going to make your dir access instant all the time. I might just write an app to do this myself.

I'm kind of tempted to offer #1 and #3 to clients and not offer #2. I don't want to offer any features that sort of work or will have weird behavior that looks like "bugs".

Also BTW yes of course there will also just be a mode to load what's in the packages and not even look at what's newest or anything. This is what the real shipping game will use of course, once you have your final packages set up it just loads them and assumes you made them right.

11/25/2008

11-24-08 - Lua Coroutines

.. are cool. Coroutines is the awesome way to do game logic that takes time. The script looks like :

PutRightFootIn();
PutRightFootOut();
if ( GetState(Friends) == dancing )
{
    DoHokeyPokey();
    TurnAround();
}
else
{
    LookAt( Friends );
    speech = StartSpeech("Screw you guys");
    FlipOff();
    WaitFor(speech);
    Go( Gome );
}

These things take various amounts of time and actually yield back execution to the game while that happens. Each NPC or game object has its own coroutine which the game basically just "pings". The Lua code really only ever runs a few lines, all it does is pick a new task and make the object do that task, and then it yields again. (obviously if it was just a static list like the example above you could do that easily other ways, the good thing about script is it can do conditionals and make decisions as it goes).

Coco lets you do a real Lua yield & resume from C calls, which lets you build your "threading" into your game's helper functions rather than relying on your designers to write scripts that yield correctly. LuaJIT (at the Coco link) is cool too.

I don't really like the Lua syntax, and I also personally hate untyped languages. (I think the right thing for scripting style languages is "weakly typed", that is all numeric types convert silenty and thing do their ToString and such, but I like to just get a compile error when I try to pass an alligator to the factorial function. Also having explicit types of "location" and "NPC" and such would be nice.) But having an off the shelf scripting language that does the coroutine thing is a pretty big win.

I generally think of simple game AI as being in 2 (or more) parts. The 2 main parts are the "brain" (or reactions) and the "job" (or task). The job/task is what the guy is currently doing and involves a plan over time and a sequence of actions. This is best done with a coroutine of some kind. The brain/reaction part is just a tick every frame that looks around and maybe changes the current task. eg. it's stuff like "saw the player, go into attack job". That can be well implemented as a regular Lua function call that the game just calls every frame, it should just do stuff that doesn't take any time and return immediately.

old rants