12/02/2008

12-02-08 - Oodle Happy

This is from a few weeks ago, but I just pulled it up again and realized it pleases me :

Free Image Hosting at www.ImageShack.us

This is an image of the little thread profile viewer that I made which shows timelines of multiple threads. This particular portion is showing an LZ decompressor thread in the top bar and the IO thread in the bottom bar. The LZ decompressor is double buffered, you can see it ping-ponging between working on the two buffers. As it finished each buffer it fires an IO to get more compressed data. You can see those IO's get fired and run in the bottom. This portion is CPU bound, you can see the IO thread is stalling out a lot (that's partly because there's no seeking and the packed data is coming through the windows file cache).

Anyway, the thing that pleases me is that it's working perfectly. The CPU is 100% busy with decompression work and never wastes any time waiting for IO. The IO thread wakes and sleeps to fire the IOs. Yum. In the future complication and mess I will never again see such nice perfect threading behavior.

12-02-08 - H264

I hate H264. It's very good. If you were making a video codec from scratch right now you would be hard pressed to beat H264. And that's part of why I hate it. Because you have to compete with it, and it's a ridiculously over-complex bucket of bolts. There are so many unnecessary modes and different kinds of blocks, different entropy coders, different kinds of motion compensation, even making a fully compliant *decoder* is a huge pain in the ass.

And the encoder is where the real pain lies. H264 like many of the standards that I hate, is not a one-to-one transform from decompressed streams to code streams. There is no explicit algorithm to find the optimal stream for a given bit rate. With all the different choices that the encoder has of different block types, different bit allocation, different motion vectors, etc. there's a massive amount of search space, and getting good compression quality hinges entirely on having a good encoder that searches that space well.

All of this stifles innovation, and also means that there are very few decent implementations available because it's so damn hard to make a good implementation. It's such a big arcane standard that's tweaked to the Nth degree, there are literally thousands of papers about it (and the Chinese seem to have really latched on to working on H264 improvements which mean there are thousands of papers written by non-English speakers, yay).

I really don't like overcomplex standards, especially this style that specifies the decoder but not the encoder. Hey, it's a nice idea in theory, it sounds good - you specify the decoder, and then over the years people can innovate and come up with better encoders that are still compatible with the same decoder. Sounds nice, but it doesn't work. What happens in the real world is that a shitty encoder gains acceptance in the mass market and that's what everyone uses. Or NO encoder ever takes hold, such as with the so-called "MPEG 4" layered audio spec, for which there exists zero mainstream encoders because it's just too damn complex.

Even aside from all that annoyance, it also just bugs me because it's not optimal. There are lots of ways to encode the exact same decoded video, and that means you're wasting code space. Any time the encoder has choices that let it produce the same output with different code streams, it means you're wasting code space. I talked about this a bit in the past in the LZ optimal parser article, but it should be intuitively obvious - you could take some of those redundant code streams and make them decode to something different, which would give you more output possibilities and thus reduce error at the same bit rate. Obviously H264 still performs well so it's not a very significant waste of code space, but you could make the coder simpler and more efficient by eliminating those choices.

Furthermore, while the motion compensation and all that is very fancy, it's still "ghetto". It's still a gross approximation of what we really want to do, which is *predict* the new pixel from its neighbors and from the past sequence of frames. That is, don't just create motion vectors and subtract the value and encode the difference - doing subtraction is a very primitive form of prediction.

Making a single predicted value and subtracting is okay *if* the predicted probability spectrum is unimodal laplacian, and you also use what information you can to predict the width of the laplacian. But it often isn't. Often there are pixels that you can make a good prediction that this pixel is very likely either A or B, and each is laplacian, but making a single prediction you'd have to guess (A+B)/2 which is no good. (an example of this is along the edges of moving objects, where you can very strongly predict any given pixel to be either from the still background or from the edge of the moving object - a bimodal distribution).

12-01-08 - VM with Virtual Disk

I had this idea for a VM to completely sandbox individual programs. It goes like this :

Do a fresh install of Windows or whatever OS. Take a full snapshot of the disk and store it in a big file. This is now *const* and will be shared by all sandboxes.

Every program that you want to run in isolation gets its own sandbox. Initially a sandbox just points at the const OS snapshot which is shared. File reads fall through to that. When you run the installer on the sandbox, it will do a bunch of file writes - those go in a journal which is unique to this sandbox that stores all the file renames, writes, deletes, etc. That can be saved or simply thrown away after the program is done.

You can optionally browse to sandbox journals. They look just like a regular disk with files. What you're seeing is the const OS snapshot with the changes that the individual program made on top of it. You can then copy files in and out of the sandbox drive to get them to your real disk.

So, for example, when you download some program from the internet that you don't trust, you can just pop up a new sandbox and run it there. This is *instant* and the program is 100% isolated from being able to do file IO to your real system. But if it makes some files you want, you can easily grab them out.

You could also mount "portals" across the sandboxes if you want to. For example, say you don't trust shitty iTunes and you want to run it in a sandbox so it can't mess with your registry or anything. But you want your music files to be on your main drive and have those be accessible to iTunes. You can mount a portal like a net drive for the sandbox to be able to "see out" to just that directory. That way you don't have to like duplicate your music files into the iTunes sandbox or whatever.

Aside from isolating rogue programs, this fixes a lot of problems with Windows. It lets you do 100% clean uninstalls - boom you just delete the whole sandbox and the program has no fingers left over. Every program gets its own registry and set of DLLs and such so there can never be conflicts. You don't have that damn problem of Windows always mysteriously going to shit after 5 years.

If you put your OS on c: and all your data on d:, you could easily just let all the sandboxes of trusted programs have portals to d: so that you can just run Photoshop in a sandbox and browse to d: and work on images, and it feels like just run normal programs on a normal computer.

11/28/2008

11-28-08 - Optimization

Yes yes, I'm generally in the "measure carefully and only optimize where needed" camp which the pretentious and condescending love. Weblog pedants love to point out pointless optimizations and say "did you really measure it?". Fie.

There's a great advantage to just blindly optimizing (as long as it doesn't severely cost you in terms of clarity or maintainability or simplicity). It lets you stop worrying. It frees up head space. Say you do some really inefficient string class and then you use it all over your code. Every time you use you have to wonder "is this okay? am I getting killed by this inefficient class?" and then you measure it, and no, it's fine. Then a month later you worry again. Then a month later you worry again. It gives you great peace of mind to know that all your base components are great. It lets you worry about the real issue, which is the higher level algorithm. (note that micro-optimizing the higher level algorithm prematurely is just always bad).

11-28-08 - Chance of CRC being bad

Say I run the CRC32 on data chunks. I've got them identified by name so I only care about false equality (collisions) on objects of the same name. I'm going to be conservative and assume I only get 31 good bits out of the CRC so there are 2^31 values, not a full 2^32. I think around 20 versions of the data per name max. So the chance of collision is a birthday "paradox" thing (it's not a freaking paradox, it's just probability!) :

P = "1 - e^( - 20*19/(2*(2^31)) )" = 8.847564e-008

Now if you have N names, what's the chance that *any* of them has a collision?


C = 1 - (1 - P)^N

So, for what N of objects is C = 50% ?


log(0.5) = N * log(1 - P)

N = log(0.5)/(log(1-P)

N = 7,834,327

You can have 7 Million files before you get a 50% chance of any one of them having a collision.

For a more reasonable number like N = 100k , what's the chance of collision?


C = "1 - (1 - 8.847564*(10^-8))^100000" = 0.00881 = 0.881 %

These probabilities are very low, and I have been pessimistic so they're even lower, but perhaps they are too high. On any given project, an 0.881% chance of a problem is probably okay, but for me I have to care about what's the chance that *any customer* has a problem, which puts me closer to the 7 M number of files and means it is likely I would see one problem. Of course a collision is not a disaster. You just tell Oodle to refresh everything manually, and the chance is that only happens once to anybody ever in the next few years.

BTW we used CRC32 to hash all the strings in Oddworld : Stranger's Wrath. We had around 100k unique strings which is pretty close to the breaking point for CRC32 (unlike the above case, they all had to be unique against each other). Everything was fine until near the end of the project when we got a few name collisions. Oh noes! Disaster! Not really. I just changed one of the names from "blue_rock.tga" to "blue_rock_1.tga" and it no longer collided. Yep, that was a crazy tough problem. (of course I can't use solutions like that as a library designer).

11/26/2008

11-26-08 - Oodle

OMG I finally got hot loading sort of working with bundles and override bundles. So you can build a package of a bunch of files. The game runs and grabs the package that pages in. You touch some file, the game sees the change, that causes it to make a single-file-bundle for the changed file, it loads in the new bundle, and then patches the resource table so that the new version is used instead of the old one in the package. Then the Game Dictionary is updated so that future runs will also load the new version.

Phew. It works but it's fragile and I'm not happy with how careful you have to be. There are lots of ways it can fail. For example, the packages all load one by one asynchronously. If you fire the big package load first, it may come in and get hooked up before the override package. Then if you hook up your game resources, you have hooked up to the old (wrong) data. You can prevent this by doing a call on each resource that you get to say "is this the right version of this resource" before hooking up to it. But currently there's just a ton of stuff like that you have to be aware of and make sure to do the right call, etc. etc.

I keep coming back to the problem that I need to know whether I have the right version of the file in a package. There are three options I can see :

1. Fuck it. Do nothing. This requires the game to load the right versions all the time. Clients could, for example, always just work unpackaged while editing, then when they make paging packages they would simply not be editable or incremental-linkable in a nice way. This is not a happy solution but it "works" in the sense that it is *reliable* about not working.

2. Use mod times. This works fine and is easy and fast *as long as you only touch and make files on your local machine*. Which would fail for major game devs since they tend to compile files on some kind of build server, or check in and share compiled files. But if you think about the way programmers work - we don't ever check in our .OBJ files, we just check in sources and everybody has to rebuild on their local machine. If you make your artists do the same thing - only share source files and always local rebuild - OR sync to a full compiled set from the server and only run compiled, but never try to mix local changes with server compiles - then it works fine.

3. Use CRC's. This is the most robust and reliable. The packages store the CRC of the files they were made from. If the source file on disk has a different CRC, we assume that the source file is better. (we don't try to use mod times to tell who's newer because of the server-client mod time problem). If the CRC is different we repackage using the local source file. Then when we load we always prefer packages that have content that matches the local CRC. This works and is stable and good and all. The only problem is all the time you spend doing CRC's on files, which may or may not be a big deal. Obviously running over your whole tree and doing CRC's all the time would be ridiculous, but you can use mod time to tell when to redo the CRC, so you only incrementally redo it. Even getting all the mod times is too slow, so you can use a persistent Watcher app to cache the drive dir listing and file infos.

The "CRC" method is not necessarilly "right" in the sense that it doesn't load the newest version of the content, but it is reliable, it always does the same thing - it loads content that corresponds to the source version on the user's disk.

BTW this last thing is something the OS should really do but totally fails on. With the amount of memory we have now, the OS should just keep a listing of every dir you've ever visited in memory at all times. Even if you visit every dir on the disk it would only be 100 Megs. You could cap it at 64 MB or something and LRU, and obviously having 64 MB of dir listings is going to make your dir access instant all the time. I might just write an app to do this myself.

I'm kind of tempted to offer #1 and #3 to clients and not offer #2. I don't want to offer any features that sort of work or will have weird behavior that looks like "bugs".

Also BTW yes of course there will also just be a mode to load what's in the packages and not even look at what's newest or anything. This is what the real shipping game will use of course, once you have your final packages set up it just loads them and assumes you made them right.

11/25/2008

11-24-08 - Lua Coroutines

.. are cool. Coroutines is the awesome way to do game logic that takes time. The script looks like :

PutRightFootIn();
PutRightFootOut();
if ( GetState(Friends) == dancing )
{
    DoHokeyPokey();
    TurnAround();
}
else
{
    LookAt( Friends );
    speech = StartSpeech("Screw you guys");
    FlipOff();
    WaitFor(speech);
    Go( Gome );
}

These things take various amounts of time and actually yield back execution to the game while that happens. Each NPC or game object has its own coroutine which the game basically just "pings". The Lua code really only ever runs a few lines, all it does is pick a new task and make the object do that task, and then it yields again. (obviously if it was just a static list like the example above you could do that easily other ways, the good thing about script is it can do conditionals and make decisions as it goes).

Coco lets you do a real Lua yield & resume from C calls, which lets you build your "threading" into your game's helper functions rather than relying on your designers to write scripts that yield correctly. LuaJIT (at the Coco link) is cool too.

I don't really like the Lua syntax, and I also personally hate untyped languages. (I think the right thing for scripting style languages is "weakly typed", that is all numeric types convert silenty and thing do their ToString and such, but I like to just get a compile error when I try to pass an alligator to the factorial function. Also having explicit types of "location" and "NPC" and such would be nice.) But having an off the shelf scripting language that does the coroutine thing is a pretty big win.

I generally think of simple game AI as being in 2 (or more) parts. The 2 main parts are the "brain" (or reactions) and the "job" (or task). The job/task is what the guy is currently doing and involves a plan over time and a sequence of actions. This is best done with a coroutine of some kind. The brain/reaction part is just a tick every frame that looks around and maybe changes the current task. eg. it's stuff like "saw the player, go into attack job". That can be well implemented as a regular Lua function call that the game just calls every frame, it should just do stuff that doesn't take any time and return immediately.

11/24/2008

11-23-08 - Hashing & Cache Line Size

A while ago I wrote about hash tables and reprobing and cache sizes. I mentioned Cliff Click's nonblocking hash table which is sort of interesting to me in theory but also scary and not something I really need at the moment. Anyhoo I was looking at : Some dude's C version of the lock free hash table and I noticed a little thing that's relevant to my old discussion.

He explicitly uses the cache line size and does +1 linear stepping along the cache line, then does a complex reprobe by rehashing, then does +1 again while in the cache line, then rehashes, etc. That seems like a pretty reasonable thing to do. I guess I considered that at the time, but didn't want to do it because it requires you to explicitly know the cache line size of the machine you're on. It literally requires a hard coded :

NumLinearSteps = CACHE_LINE_SIZE / sizeof(table_entry);

Some other : blogs about the lock free hash have extensive link lists.

11/22/2008

11-22-08 - Rasterization

I just found this pretty sweet introduction article on barycentric rasterization from 2004. It's not super advanced, but it starts at the beginning and works through things and is very readable. There are some dumb things in the block checking, so if you care go to the last page and see the posts by "rarefluid".

BTW the "edge equations" are really 2d plane equations (edge cross products). Checking just the edge planes against 8x8 blocks is only a rough quick reject. You can have blocks that are right outside of one vertex at an acute corner, and those blocks are "inside" all the planes but outside of the triangle. The code they have posted also checks against the bounding box of the whole triangle which largely fixes this case. At most they will consider one extra 8x8 block which doesn't actually contain any pixels.

(it's also really not yet a full barycentric rasterizer, he's just doing the edge tests that way; from his other posts I figure he's doing interpolation using the normal homogenous way, but if you're doing the edge-tests like that then you should just go ahead and do your interpolation barycentric too).

This kind of block-based barycentric rasterizer is very similar to what hardware does. One of the nice things about it is the blocks can easily be dispatched to microthreads to rasterize in parallel, and the blocks are natural quanta to check against a coarse Z buffer.

The old Olano-Greer paper about homogenous coordinate rasterization is now online in HTML. Some other random junk I found that I think is just junk : Reducing divisions for polygon mappers & Triangle Setup .

This blog about Software occlusion culling is literally a blast from the past. As I've often suggested, if you care about that stuff, the SurRender Umbra technical manual is still the godly bible on all things occlusion. (or you can read my ancient article on it ). But I also still think that object-based occlusion like that is just a bad idea.

Coarse Z that's updated by the rasterizer is however a good thing. Doing your own on top of what the card does is pretty lame though. This is yet another awesome win from Larrabee. If we/they do a coarse Z buffer, it can get used by the CPUs to do whole-object rejections, or whole-triangle rejections, or macroblock rejections.

Apparently the guy who wrote that top article is Nicolas Capens ; he wrote "swShader" which was an open source DX9 software rasterizer, which got taken down and is now a commercial product (which was a silly thing to do, of course any customer would rather buy Pixomatic !!). I learned this from a random flame war he got in. Don't you love the internet?

11/21/2008

11-21-08 - DXTC Part 4

So I finally implemented the end point lsqr fit from indeces thing that Simon does for myself. One thing immediately fell out - doing just 4 means and then end point fit pretty much equals the best mode of Squish. That's very cool because it's quite fast. Note that this is not doing all the searches of all possible clusterings - I just pick one clustering from 4 means and then optimize the end points for those indeces. (when I do 4 means I actually try a few different ways as I mentioned previously, I try all the permutations of putting the 4 means on the 4 palette entries, which is 4!/2 ways = 12 ways, but then I only optimize the best one of those, so it's still very fast).

One thing I noticed is that the lsqr fit really doesn't do much other than shift an end point by one. That is, the end points are in 565 already, you can do this nice optimization in floats, but when you quantize back to 565 you pretty much always hit the point you started with or at most a step of 1.

So the new "CB" modes are :

CB 1 = just 4 means then lsqr fit , faster than Squish, a bit slower than ryg. Quality is roughly competitive with Squish, but they have different strengths and weakness, so taking the best of the two might be reasonable. Squish never beats "CB 1" by a lot, but "CB 1" kills it on the weird "frymire" image.

CB 2 = CB 1 followed by simulated annealing.

CB 3 = Very slow heavy optimization. This is an attempt to see what "optimal" would get you. It tries "CB 2" and then it also tries using all 16 colors in the block as end points, so 16*16 trials, does the lsqr optimization on each trial, and then anneals the best of those. There are still various other things to try that might find a better result, but this is already pretty crazy slow. This is too slow to use in real production, the idea is simply to get an idea of how close "CB 2" is to optimal. Of course "CB 3" could still be far off optimal, I'm only conjecturing that it's close.

One of the interesting things to look at is the curve of diminishing returns from CB1 to 2 to 3. In most cases there's only a small improvement from 2 to 3, but there are exceptions, mainly in the weird degenerate images. kodim02 is one case (this is a photo but it's almost all red), and frymire of course. That meets expectations. On noisy natural images, the cluster of colors is pretty close to Gaussian noise, which works well with the PCA single line fit and the least-squares contiuous distribution approximation. On weird images with degenerate cases there can be stranger optimal solutions (for example : putting one of the color end points outside of the volume of original colors, so that one of the interpolated 1/3 colors can hit a certain value more precisely).

ADDENDUM : you might validly object and ask why the annealing is not getting closer to the global optimum. There are some approximations in the annealing that are hurting. For one thing I only try wiggling the ends by 1 step in 565. Then I don't really run it for very long, so it doesn't have a chance to make big walks and get to really different solutions. All it can really do is local optimizations with small steps to tunnel past local barriers to find better minima - it's not trying huge steps to other parts of the solution space. Theoretically if I ran a much longer annealing schedule with more time spent at higher temperatures it would do a better job of finding the global minimum. But I'm happy with this approach - the annealing is just an improved local optimization that steps bast small barriers, and to find drastically different global solutions I have to seed the trial differently.

The new numbers : (RMSE per pixel)

file CB 1 CB 2 CB 3 Squish opt Squish ryg D3DX8 FastDXT
kodim01.bmp 8.447 8.3145 8.2678 8.2829 8.3553 8.9185 9.8466 9.9565
kodim02.bmp 5.6492 5.4759 5.2864 6.1079 6.2876 6.8011 7.4308 8.456
kodim03.bmp 4.7533 4.6776 4.6591 4.7869 4.9181 5.398 6.094 6.4839
kodim04.bmp 5.5234 5.4286 5.3967 5.6978 5.8116 6.3424 7.1032 7.3189
kodim05.bmp 9.7619 9.6171 9.5654 9.6493 9.7223 10.2522 11.273 12.0156
kodim06.bmp 7.2524 7.1395 7.1086 7.15 7.2171 7.6423 8.5195 8.6202
kodim07.bmp 5.7557 5.6602 5.634 5.784 5.8834 6.3181 7.2182 7.372
kodim08.bmp 10.3879 10.2587 10.2056 10.2401 10.3212 10.8534 11.8703 12.2668
kodim09.bmp 5.3242 5.2477 5.2247 5.2935 5.3659 5.7315 6.5332 6.6716
kodim10.bmp 5.2564 5.1818 5.1657 5.2478 5.3366 5.7089 6.4601 6.4592
kodim11.bmp 6.7614 6.6503 6.6139 6.731 6.8206 7.3099 8.1056 8.2492
kodim12.bmp 4.8159 4.747 4.7308 4.7968 4.8718 5.342 6.005 6.0748
kodim13.bmp 11.0183 10.8489 10.7894 10.8684 10.9428 11.6049 12.7139 12.9978
kodim14.bmp 8.4325 8.3105 8.2723 8.3062 8.3883 8.8656 9.896 10.8481
kodim15.bmp 5.6871 5.5891 5.548 5.8304 5.9525 6.3297 7.3085 7.4932
kodim16.bmp 5.1351 5.0439 5.0274 5.065 5.1629 5.5526 6.3361 6.1592
kodim17.bmp 5.5999 5.5146 5.4976 5.509 5.6127 6.0357 6.7395 6.8989
kodim18.bmp 8.1345 8.0103 7.9791 7.9924 8.0897 8.6925 9.5357 9.7857
kodim19.bmp 6.6903 6.5979 6.5645 6.5762 6.652 7.2684 7.9229 8.0096
kodim20.bmp 5.4532 5.3825 5.3582 5.4568 5.5303 5.9087 6.4878 6.8629
kodim21.bmp 7.2207 7.1046 7.0718 7.1351 7.2045 7.6764 8.4703 8.6508
kodim22.bmp 6.52 6.4191 6.3933 6.4348 6.5127 7.0705 8.0046 7.9488
kodim23.bmp 4.9599 4.8899 4.8722 4.9063 5.0098 5.3789 6.3057 6.888
kodim24.bmp 8.5761 8.4633 8.4226 8.4299 8.5274 8.9206 9.9389 10.5156
clegg.bmp 14.8934 14.8017 14.6102 14.9736 15.2566 15.7163 21.471 32.7192
FRYMIRE.bmp 7.4898 7.3461 6.0851 10.7105 12.541 12.681 16.7308 28.9283
LENA.bmp 7.1286 7.0286 6.9928 7.1432 7.2346 7.6053 8.742 9.5143
MONARCH.bmp 6.6617 6.5616 6.5281 6.5567 6.6292 7.0313 8.1053 8.6993
PEPPERS.bmp 6.0418 5.9523 5.8026 6.4036 6.5208 6.9006 8.1855 8.8893
SAIL.bmp 8.4339 8.3077 8.2665 8.3254 8.3903 8.9823 9.7838 10.5673
SERRANO.bmp 6.7347 6.2445 5.9454 6.3524 6.757 7.0722 9.0549 18.3631
TULIPS.bmp 7.6491 7.5406 7.5065 7.5805 7.656 8.0101 9.3817 10.5873
lena512ggg.bmp 4.8961 4.8395 4.8241 4.8426 4.915 5.1986 6.0059 5.5247
lena512pink.bmp 4.6736 4.5922 4.5767 4.5878 4.6726 5.0987 5.8064 5.838
lena512pink0g.bmp 3.7992 3.7523 3.7455 3.7572 3.8058 4.2756 5.0732 4.8933
linear_ramp1.BMP 1.5607 1.348 1.3513 1.4035 2.1243 2.0939 2.6317 3.981
linear_ramp2.BMP 1.5097 1.2772 1.2772 1.3427 2.3049 1.9306 2.5396 4.0756
orange_purple.BMP 2.9842 2.9048 2.9074 2.9125 3.0685 3.2684 4.4123 7.937
pink_green.BMP 3.2837 3.2041 3.2031 3.2121 3.3679 3.7949 4.5127 7.3481
sum : 250.8565 246.2747 243.2776 252.3828 259.7416 275.5826 318.5562 370.8691

old rants