There's a great advantage to just blindly optimizing (as long as it doesn't severely cost you in terms of clarity or maintainability or simplicity). It lets you stop worrying. It frees up head space. Say you do some really inefficient string class and then you use it all over your code. Every time you use you have to wonder "is this okay? am I getting killed by this inefficient class?" and then you measure it, and no, it's fine. Then a month later you worry again. Then a month later you worry again. It gives you great peace of mind to know that all your base components are great. It lets you worry about the real issue, which is the higher level algorithm. (note that micro-optimizing the higher level algorithm prematurely is just always bad).
11/28/2008
11-28-08 - Optimization
11-28-08 - Chance of CRC being bad
P = "1 - e^( - 20*19/(2*(2^31)) )" = 8.847564e-008
Now if you have N names, what's the chance that *any* of them has a collision?
C = 1 - (1 - P)^N
So, for what N of objects is C = 50% ?
log(0.5) = N * log(1 - P) N = log(0.5)/(log(1-P) N = 7,834,327
You can have 7 Million files before you get a 50% chance of any one of them having a collision.
For a more reasonable number like N = 100k , what's the chance of collision?
C = "1 - (1 - 8.847564*(10^-8))^100000" = 0.00881 = 0.881 %
These probabilities are very low, and I have been pessimistic so they're even lower, but perhaps they are too high. On any given project, an 0.881% chance of a problem is probably okay, but for me I have to care about what's the chance that *any customer* has a problem, which puts me closer to the 7 M number of files and means it is likely I would see one problem. Of course a collision is not a disaster. You just tell Oodle to refresh everything manually, and the chance is that only happens once to anybody ever in the next few years.
BTW we used CRC32 to hash all the strings in Oddworld : Stranger's Wrath. We had around 100k unique strings which is pretty close to the breaking point for CRC32 (unlike the above case, they all had to be unique against each other). Everything was fine until near the end of the project when we got a few name collisions. Oh noes! Disaster! Not really. I just changed one of the names from "blue_rock.tga" to "blue_rock_1.tga" and it no longer collided. Yep, that was a crazy tough problem. (of course I can't use solutions like that as a library designer).
11/26/2008
11-26-08 - Oodle
Phew. It works but it's fragile and I'm not happy with how careful you have to be. There are lots of ways it can fail. For example, the packages all load one by one asynchronously. If you fire the big package load first, it may come in and get hooked up before the override package. Then if you hook up your game resources, you have hooked up to the old (wrong) data. You can prevent this by doing a call on each resource that you get to say "is this the right version of this resource" before hooking up to it. But currently there's just a ton of stuff like that you have to be aware of and make sure to do the right call, etc. etc.
I keep coming back to the problem that I need to know whether I have the right version of the file in a package. There are three options I can see :
1. Fuck it. Do nothing. This requires the game to load the right versions all the time. Clients could, for example, always just work unpackaged while editing, then when they make paging packages they would simply not be editable or incremental-linkable in a nice way. This is not a happy solution but it "works" in the sense that it is *reliable* about not working.
2. Use mod times. This works fine and is easy and fast *as long as you only touch and make files on your local machine*. Which would fail for major game devs since they tend to compile files on some kind of build server, or check in and share compiled files. But if you think about the way programmers work - we don't ever check in our .OBJ files, we just check in sources and everybody has to rebuild on their local machine. If you make your artists do the same thing - only share source files and always local rebuild - OR sync to a full compiled set from the server and only run compiled, but never try to mix local changes with server compiles - then it works fine.
3. Use CRC's. This is the most robust and reliable. The packages store the CRC of the files they were made from. If the source file on disk has a different CRC, we assume that the source file is better. (we don't try to use mod times to tell who's newer because of the server-client mod time problem). If the CRC is different we repackage using the local source file. Then when we load we always prefer packages that have content that matches the local CRC. This works and is stable and good and all. The only problem is all the time you spend doing CRC's on files, which may or may not be a big deal. Obviously running over your whole tree and doing CRC's all the time would be ridiculous, but you can use mod time to tell when to redo the CRC, so you only incrementally redo it. Even getting all the mod times is too slow, so you can use a persistent Watcher app to cache the drive dir listing and file infos.
The "CRC" method is not necessarilly "right" in the sense that it doesn't load the newest version of the content, but it is reliable, it always does the same thing - it loads content that corresponds to the source version on the user's disk.
BTW this last thing is something the OS should really do but totally fails on. With the amount of memory we have now, the OS should just keep a listing of every dir you've ever visited in memory at all times. Even if you visit every dir on the disk it would only be 100 Megs. You could cap it at 64 MB or something and LRU, and obviously having 64 MB of dir listings is going to make your dir access instant all the time. I might just write an app to do this myself.
I'm kind of tempted to offer #1 and #3 to clients and not offer #2. I don't want to offer any features that sort of work or will have weird behavior that looks like "bugs".
Also BTW yes of course there will also just be a mode to load what's in the packages and not even look at what's newest or anything. This is what the real shipping game will use of course, once you have your final packages set up it just loads them and assumes you made them right.
11/25/2008
11-24-08 - Lua Coroutines
PutRightFootIn();
PutRightFootOut();
if ( GetState(Friends) == dancing )
{
DoHokeyPokey();
TurnAround();
}
else
{
LookAt( Friends );
speech = StartSpeech("Screw you guys");
FlipOff();
WaitFor(speech);
Go( Gome );
}
These things take various amounts of time and actually yield back execution to the game while that happens. Each NPC or game object has its own coroutine which the game basically just "pings". The Lua code really only ever runs a few lines, all it does is pick a new task and make the object do that task, and then it yields again. (obviously if it was just a static list like the example above you could do that easily other ways, the good thing about script is it can do conditionals and make decisions as it goes).
Coco lets you do a real Lua yield & resume from C calls, which lets you build your "threading" into your game's helper functions rather than relying on your designers to write scripts that yield correctly. LuaJIT (at the Coco link) is cool too.
I don't really like the Lua syntax, and I also personally hate untyped languages. (I think the right thing for scripting style languages is "weakly typed", that is all numeric types convert silenty and thing do their ToString and such, but I like to just get a compile error when I try to pass an alligator to the factorial function. Also having explicit types of "location" and "NPC" and such would be nice.) But having an off the shelf scripting language that does the coroutine thing is a pretty big win.
I generally think of simple game AI as being in 2 (or more) parts. The 2 main parts are the "brain" (or reactions) and the "job" (or task). The job/task is what the guy is currently doing and involves a plan over time and a sequence of actions. This is best done with a coroutine of some kind. The brain/reaction part is just a tick every frame that looks around and maybe changes the current task. eg. it's stuff like "saw the player, go into attack job". That can be well implemented as a regular Lua function call that the game just calls every frame, it should just do stuff that doesn't take any time and return immediately.
11/24/2008
11-23-08 - Hashing & Cache Line Size
He explicitly uses the cache line size and does +1 linear stepping along the cache line, then does a complex reprobe by rehashing, then does +1 again while in the cache line, then rehashes, etc. That seems like a pretty reasonable thing to do. I guess I considered that at the time, but didn't want to do it because it requires you to explicitly know the cache line size of the machine you're on. It literally requires a hard coded :
NumLinearSteps = CACHE_LINE_SIZE / sizeof(table_entry);
Some other : blogs about the lock free hash have extensive link lists.
11/22/2008
11-22-08 - Rasterization
BTW the "edge equations" are really 2d plane equations (edge cross products). Checking just the edge planes against 8x8 blocks is only a rough quick reject. You can have blocks that are right outside of one vertex at an acute corner, and those blocks are "inside" all the planes but outside of the triangle. The code they have posted also checks against the bounding box of the whole triangle which largely fixes this case. At most they will consider one extra 8x8 block which doesn't actually contain any pixels.
(it's also really not yet a full barycentric rasterizer, he's just doing the edge tests that way; from his other posts I figure he's doing interpolation using the normal homogenous way, but if you're doing the edge-tests like that then you should just go ahead and do your interpolation barycentric too).
This kind of block-based barycentric rasterizer is very similar to what hardware does. One of the nice things about it is the blocks can easily be dispatched to microthreads to rasterize in parallel, and the blocks are natural quanta to check against a coarse Z buffer.
The old Olano-Greer paper about homogenous coordinate rasterization is now online in HTML. Some other random junk I found that I think is just junk : Reducing divisions for polygon mappers & Triangle Setup .
This blog about Software occlusion culling is literally a blast from the past. As I've often suggested, if you care about that stuff, the SurRender Umbra technical manual is still the godly bible on all things occlusion. (or you can read my ancient article on it ). But I also still think that object-based occlusion like that is just a bad idea.
Coarse Z that's updated by the rasterizer is however a good thing. Doing your own on top of what the card does is pretty lame though. This is yet another awesome win from Larrabee. If we/they do a coarse Z buffer, it can get used by the CPUs to do whole-object rejections, or whole-triangle rejections, or macroblock rejections.
Apparently the guy who wrote that top article is Nicolas Capens ; he wrote "swShader" which was an open source DX9 software rasterizer, which got taken down and is now a commercial product (which was a silly thing to do, of course any customer would rather buy Pixomatic !!). I learned this from a random flame war he got in. Don't you love the internet?
11/21/2008
11-21-08 - DXTC Part 4
One thing I noticed is that the lsqr fit really doesn't do much other than shift an end point by one. That is, the end points are in 565 already, you can do this nice optimization in floats, but when you quantize back to 565 you pretty much always hit the point you started with or at most a step of 1.
So the new "CB" modes are :
CB 1 = just 4 means then lsqr fit , faster than Squish, a bit slower than ryg. Quality is roughly competitive with Squish, but they have different strengths and weakness, so taking the best of the two might be reasonable. Squish never beats "CB 1" by a lot, but "CB 1" kills it on the weird "frymire" image.
CB 2 = CB 1 followed by simulated annealing.
CB 3 = Very slow heavy optimization. This is an attempt to see what "optimal" would get you. It tries "CB 2" and then it also tries using all 16 colors in the block as end points, so 16*16 trials, does the lsqr optimization on each trial, and then anneals the best of those. There are still various other things to try that might find a better result, but this is already pretty crazy slow. This is too slow to use in real production, the idea is simply to get an idea of how close "CB 2" is to optimal. Of course "CB 3" could still be far off optimal, I'm only conjecturing that it's close.
One of the interesting things to look at is the curve of diminishing returns from CB1 to 2 to 3. In most cases there's only a small improvement from 2 to 3, but there are exceptions, mainly in the weird degenerate images. kodim02 is one case (this is a photo but it's almost all red), and frymire of course. That meets expectations. On noisy natural images, the cluster of colors is pretty close to Gaussian noise, which works well with the PCA single line fit and the least-squares contiuous distribution approximation. On weird images with degenerate cases there can be stranger optimal solutions (for example : putting one of the color end points outside of the volume of original colors, so that one of the interpolated 1/3 colors can hit a certain value more precisely).
ADDENDUM : you might validly object and ask why the annealing is not getting closer to the global optimum. There are some approximations in the annealing that are hurting. For one thing I only try wiggling the ends by 1 step in 565. Then I don't really run it for very long, so it doesn't have a chance to make big walks and get to really different solutions. All it can really do is local optimizations with small steps to tunnel past local barriers to find better minima - it's not trying huge steps to other parts of the solution space. Theoretically if I ran a much longer annealing schedule with more time spent at higher temperatures it would do a better job of finding the global minimum. But I'm happy with this approach - the annealing is just an improved local optimization that steps bast small barriers, and to find drastically different global solutions I have to seed the trial differently.
The new numbers : (RMSE per pixel)
| file | CB 1 | CB 2 | CB 3 | Squish opt | Squish | ryg | D3DX8 | FastDXT |
| kodim01.bmp | 8.447 | 8.3145 | 8.2678 | 8.2829 | 8.3553 | 8.9185 | 9.8466 | 9.9565 |
| kodim02.bmp | 5.6492 | 5.4759 | 5.2864 | 6.1079 | 6.2876 | 6.8011 | 7.4308 | 8.456 |
| kodim03.bmp | 4.7533 | 4.6776 | 4.6591 | 4.7869 | 4.9181 | 5.398 | 6.094 | 6.4839 |
| kodim04.bmp | 5.5234 | 5.4286 | 5.3967 | 5.6978 | 5.8116 | 6.3424 | 7.1032 | 7.3189 |
| kodim05.bmp | 9.7619 | 9.6171 | 9.5654 | 9.6493 | 9.7223 | 10.2522 | 11.273 | 12.0156 |
| kodim06.bmp | 7.2524 | 7.1395 | 7.1086 | 7.15 | 7.2171 | 7.6423 | 8.5195 | 8.6202 |
| kodim07.bmp | 5.7557 | 5.6602 | 5.634 | 5.784 | 5.8834 | 6.3181 | 7.2182 | 7.372 |
| kodim08.bmp | 10.3879 | 10.2587 | 10.2056 | 10.2401 | 10.3212 | 10.8534 | 11.8703 | 12.2668 |
| kodim09.bmp | 5.3242 | 5.2477 | 5.2247 | 5.2935 | 5.3659 | 5.7315 | 6.5332 | 6.6716 |
| kodim10.bmp | 5.2564 | 5.1818 | 5.1657 | 5.2478 | 5.3366 | 5.7089 | 6.4601 | 6.4592 |
| kodim11.bmp | 6.7614 | 6.6503 | 6.6139 | 6.731 | 6.8206 | 7.3099 | 8.1056 | 8.2492 |
| kodim12.bmp | 4.8159 | 4.747 | 4.7308 | 4.7968 | 4.8718 | 5.342 | 6.005 | 6.0748 |
| kodim13.bmp | 11.0183 | 10.8489 | 10.7894 | 10.8684 | 10.9428 | 11.6049 | 12.7139 | 12.9978 |
| kodim14.bmp | 8.4325 | 8.3105 | 8.2723 | 8.3062 | 8.3883 | 8.8656 | 9.896 | 10.8481 |
| kodim15.bmp | 5.6871 | 5.5891 | 5.548 | 5.8304 | 5.9525 | 6.3297 | 7.3085 | 7.4932 |
| kodim16.bmp | 5.1351 | 5.0439 | 5.0274 | 5.065 | 5.1629 | 5.5526 | 6.3361 | 6.1592 |
| kodim17.bmp | 5.5999 | 5.5146 | 5.4976 | 5.509 | 5.6127 | 6.0357 | 6.7395 | 6.8989 |
| kodim18.bmp | 8.1345 | 8.0103 | 7.9791 | 7.9924 | 8.0897 | 8.6925 | 9.5357 | 9.7857 |
| kodim19.bmp | 6.6903 | 6.5979 | 6.5645 | 6.5762 | 6.652 | 7.2684 | 7.9229 | 8.0096 |
| kodim20.bmp | 5.4532 | 5.3825 | 5.3582 | 5.4568 | 5.5303 | 5.9087 | 6.4878 | 6.8629 |
| kodim21.bmp | 7.2207 | 7.1046 | 7.0718 | 7.1351 | 7.2045 | 7.6764 | 8.4703 | 8.6508 |
| kodim22.bmp | 6.52 | 6.4191 | 6.3933 | 6.4348 | 6.5127 | 7.0705 | 8.0046 | 7.9488 |
| kodim23.bmp | 4.9599 | 4.8899 | 4.8722 | 4.9063 | 5.0098 | 5.3789 | 6.3057 | 6.888 |
| kodim24.bmp | 8.5761 | 8.4633 | 8.4226 | 8.4299 | 8.5274 | 8.9206 | 9.9389 | 10.5156 |
| clegg.bmp | 14.8934 | 14.8017 | 14.6102 | 14.9736 | 15.2566 | 15.7163 | 21.471 | 32.7192 |
| FRYMIRE.bmp | 7.4898 | 7.3461 | 6.0851 | 10.7105 | 12.541 | 12.681 | 16.7308 | 28.9283 |
| LENA.bmp | 7.1286 | 7.0286 | 6.9928 | 7.1432 | 7.2346 | 7.6053 | 8.742 | 9.5143 |
| MONARCH.bmp | 6.6617 | 6.5616 | 6.5281 | 6.5567 | 6.6292 | 7.0313 | 8.1053 | 8.6993 |
| PEPPERS.bmp | 6.0418 | 5.9523 | 5.8026 | 6.4036 | 6.5208 | 6.9006 | 8.1855 | 8.8893 |
| SAIL.bmp | 8.4339 | 8.3077 | 8.2665 | 8.3254 | 8.3903 | 8.9823 | 9.7838 | 10.5673 |
| SERRANO.bmp | 6.7347 | 6.2445 | 5.9454 | 6.3524 | 6.757 | 7.0722 | 9.0549 | 18.3631 |
| TULIPS.bmp | 7.6491 | 7.5406 | 7.5065 | 7.5805 | 7.656 | 8.0101 | 9.3817 | 10.5873 |
| lena512ggg.bmp | 4.8961 | 4.8395 | 4.8241 | 4.8426 | 4.915 | 5.1986 | 6.0059 | 5.5247 |
| lena512pink.bmp | 4.6736 | 4.5922 | 4.5767 | 4.5878 | 4.6726 | 5.0987 | 5.8064 | 5.838 |
| lena512pink0g.bmp | 3.7992 | 3.7523 | 3.7455 | 3.7572 | 3.8058 | 4.2756 | 5.0732 | 4.8933 |
| linear_ramp1.BMP | 1.5607 | 1.348 | 1.3513 | 1.4035 | 2.1243 | 2.0939 | 2.6317 | 3.981 |
| linear_ramp2.BMP | 1.5097 | 1.2772 | 1.2772 | 1.3427 | 2.3049 | 1.9306 | 2.5396 | 4.0756 |
| orange_purple.BMP | 2.9842 | 2.9048 | 2.9074 | 2.9125 | 3.0685 | 3.2684 | 4.4123 | 7.937 |
| pink_green.BMP | 3.2837 | 3.2041 | 3.2031 | 3.2121 | 3.3679 | 3.7949 | 4.5127 | 7.3481 |
| sum : | 250.8565 | 246.2747 | 243.2776 | 252.3828 | 259.7416 | 275.5826 | 318.5562 | 370.8691 |
11-21-08 - More Texture Compression Nonsense
There're multiple different issues here. There's providing data to the card in a way that's good for texture cache usage (DXT1 is a big winner here, especially on the RSX apparently). Another is keeping data small in memory so you can hold more. Obviously that's a bigger issue on the consoles than the PC, but you always want things smaller in memory if you can. Another issue is paging off disk quickly. Smaller data loads faster, though that's not a huge issue off hard disk, and even off DVD if you're doing something like SVT you are so dominated by seek times that file sizes may not be that big. Another issue is the size of your content for transmission, either on DVD or over the net; it's nice to make your game downloadable and reasonably small.
I guess that the Xenon guys sort of encourage you to use PCT to pack your textures tiny on disk or for transmission. PCT seems to be a variant of HD-photo. MCT is I guess a DXT-recompressor, but I can't find details on it. I'm not spilling any beans that you can't find at MS Research or that end users aren't figuring out or PS3 developers are hinting at.
The "Id way" I guess is storing data on disk in JPEG, paging that, decompressing, then recompressing to DXT5-YCoCg. That has the advantage of being reasonably small on disk, which is important if you have a huge unique textured world so you have N gigs of textures. But I wonder what kind of quality they really get from that. They're using two different lossy schemes, and when you compress through two different lossy schemes the errors usually add up. I would guess that the total error from running through both compressors puts them in the ballpark of DXT1. They're using 8 bits per pixel in memory, and presumably something like 1-2 bits per pixel on disk.
Instead you could just use DXT1 , at 4 bits per pixel in memory, and do a DXT1-recompressor, which I would guess could get around 2 bits per pixel on disk. DXT1-recompressed is lower quality than JPEG, but I wonder how it compares to JPEG-then-DXT5 ?
If I ignore the specifics of the Id method or the XBox360 for the moment, the general options are :
1. Compress textures to the desired hardware-ready format (DXT1 or DXT5 or whatever) with a high quality offline compressor. Store them in memory in this format. Recompress with a shuffler/delta algorithm to make them smaller for transmisssion on disk, but don't take them out of hardware-ready format. One disadvantage of this is that if you have to support multiple hardware-ready texture formats on the PC you have to transmit them all or convert.
2. Compress texture on disk with a lossy scheme that's very tight and has good RMSE/size performance. Decompress on load and then recompress to hardware-ready format (or not, if you have enough video memory). One advantage is you can look at the user's machine and decide what hardware-ready format to use. A disadvantange is the realtime DXT compressors are much lower quality and even though they are very fast the decompress-recompress is still a pretty big CPU load for paging.
3. Hybrid of the two : use some very small format for internet transmision or DVD storage, but when you install to HD do the decompress and recompress to hardware-ready format then. Still has the problem that you're running two different lossy compressors which is evil, but reduces CPU load during game runs.
I don't think that having smaller data on disk is much of a win for paging performance. If you're paging any kind of reasonable fine-grained unit you're just so dominated by seek time, and hard disk throughput is really very fast (and it's all async anyway so the time doesn't hurt). For example a 256 x 256 texture at 4 bits per pixel is 32k bytes which loads in 0.32 millis at 100 MB/sec. (BTW that's a handy rule of thumb - 100k takes 1 milli). So the real interesting goals are : A) be small in memory so that you can have a lot of goodies on screen, and B) be small for transmission to reduce install time, download time, number of DVDs, etc.
One idea I've been tossing around is the idea of using lossy compressed textures in memory as a texture cache to avoid hitting disk. For example, a 256 X 256 texture at 1 bit per pixel is only 8k bytes. If you wanted to page textures in that unit size you would be ridiculously dominated by seek time. Instead you could just hold a bunch of them in memory and decompress to "page" them into usable formats. The disk throughput is comparable to the decompresser throughput, but not having seek times means you can decompress them on demand. I'm not convinced that this is actually a useful thing though.
BTW I also found this Old Epic page about the NVidia DXT1 problem which happily is fixed but I guess there are still loads of those old chips around; this might still make using DXT1 exclusively on the PC impossible. The page also has some good sample textures to kill your compressor with.
11/20/2008
11-20-08 - DXTC Part 3
Small block inherently gives up a lot of coding capability, because you aren't allowed to use any cross-block information. H264 and HDPhoto are both small block (4x4) but both make very heavy use of cross-block information for either context coding, DC prediction, or lapping. Even JPEG is not a true small block coder because it has side information in the Huffman table that captures whole-image statistics.
Fixed bitrate blocks inherently gives up even more. It kills your ability to do any rate-distortion type of optimization. You can't allocate bits where they're needed. You might have images with big flat sections where you are actually wasting bits (you don't need all 64 bits for a 4x4 block), and then you have other areas that desperately need a few more bits, but you can't gived them to them.
So, what if we keep ourselves constrained to the idea of a fixed size block and try to use a better coder? What is the limit on how well you can do with those constraints? I thought I'd see if I could answer that reasonably quickly.
What I made is an 8x8 pixel fixed rate coder. It has zero side information, eg. no per-image tables. (it does have about 16 constants that are used for all images). Each block is coded to a fixed bit rate. Here I'm coding to 4 bits per pixel (the same as DXT1) so that I can compare RMSE directly, which is a 32 byte block for 8x8 pixels. It also works pretty well at 24 byte blocks (which is 1 bit per byte), or 64 for high quality, etc.
This 8x8 coder does a lossless YCoCg transform and a lossy DCT. Unlike JPEG, there is no quantization, no subsampling of chroma, no huffman table, etc. Coding is via an embedded bitplane coder with zerotree-style context prediction. I haven't spent much time on this, so the coding schemes are very rough. CodeTree and CodeLinear are two different coding techniques, and neither one is ideal.
Obviously going to 8x8 instead of 4x4 is a big advantage, but it seems like a more reasonable size for future hardware. To really improve the coding significantly on 4x4 blocks you would have to start using something like VQ with a codebook which hardware people don't like.
In the table below you'll see that CodeTree and CodeLinear generally provide a nice improvement on the natural images, about 20%. In general they're pretty close to half way between DXTC and the full image coder "cbwave". They have a different kind of perceptual artifact when they have errors - unlike DXTC which just make things really blocky, these get the halo ringing artifacts like JPEG (it's inherent in truncating DCT's).
The new coders do really badly on the weird synthetic images from bragzone, like clegg, frymire and serrano. I'd have to fix that if I really cared about these things.
One thing that is encouraging is that this coder does *very* well on the simple synthetic images, like the "linear_ramp" and the "pink_green" and "orange_purple". I think these synthetic images are a lot like what game lightmaps are like, and the new schemes are near lossless on them.
BTW image compression for paging is sort of a whole other issue. For one thing, a per-image table is perfectly reasonable to have, and you could work on something like 32x32 blocks. But more important is that in the short term you still need to provide the image in DXT1 to the graphics hardware. So you either just have to page the data in DXT1 already, or you have to recompress it, and as we've seen here the "real-time" DXT1 recompressors are not high enough quality for ubiquitous use.
ADDENDUM I forgot but you may have noticed the "ryg" in this table is also not the same as previous "ryg" - I fixed a few of the little bugs and you can see the improvement here. It's still not competitive, I think there may be more errors in the best fit optimization portion of the code, but he's got that code so optimized and obfuscated I can't see what's going on.
Oh, BTW the "CB" in this table is different than the previous table; the one here uses 4-means instead of 2-means, seeded from the pca direction, and then I try using each of the 4 means as endpoints. It's still not quite as good as Squish, but it's closer. It does beat Squish on some of the more degenerate images at the bottom, such as "linear_ramp". It also beats Squish on artificial tests like images that contain only 2 colors. For example on linear_ramp2 without optimization, 4-means gets 1.5617 while Squish gets 2.3049 ; most of that difference goes away after annealing though.
RMSE per pixel :
| file | Squish opt | Squish | CB opt | CB | ryg | D3DX8 | FastDXT | cbwave KLTY | CodeTree | CodeLinear | |||
| kodim01.bmp | 8.2808 | 8.3553 | 8.3035 | 8.516 | 8.9185 | 9.8466 | 9.9565 | 2.6068 | 5.757835 | 5.659023 | |||
| kodim02.bmp | 6.1086 | 6.2876 | 6.1159 | 6.25 | 6.8011 | 7.4308 | 8.456 | 1.6973 | 4.131007 | 4.144241 | |||
| kodim03.bmp | 4.7804 | 4.9181 | 4.7953 | 4.9309 | 5.398 | 6.094 | 6.4839 | 1.3405 | 3.369018 | 3.50115 | |||
| kodim04.bmp | 5.6913 | 5.8116 | 5.7201 | 5.8837 | 6.3424 | 7.1032 | 7.3189 | 1.8076 | 4.254454 | 4.174228 | |||
| kodim05.bmp | 9.6472 | 9.7223 | 9.6766 | 9.947 | 10.2522 | 11.273 | 12.0156 | 2.9739 | 6.556041 | 6.637885 | |||
| kodim06.bmp | 7.1472 | 7.2171 | 7.1596 | 7.3224 | 7.6423 | 8.5195 | 8.6202 | 2.0132 | 5.013081 | 4.858232 | |||
| kodim07.bmp | 5.7804 | 5.8834 | 5.7925 | 5.9546 | 6.3181 | 7.2182 | 7.372 | 1.4645 | 3.76087 | 3.79437 | |||
| kodim08.bmp | 10.2391 | 10.3212 | 10.2865 | 10.5499 | 10.8534 | 11.8703 | 12.2668 | 3.2936 | 6.861067 | 6.927792 | |||
| kodim09.bmp | 5.2871 | 5.3659 | 5.3026 | 5.4236 | 5.7315 | 6.5332 | 6.6716 | 1.6269 | 3.473094 | 3.479715 | |||
| kodim10.bmp | 5.2415 | 5.3366 | 5.2538 | 5.3737 | 5.7089 | 6.4601 | 6.4592 | 1.7459 | 3.545115 | 3.593297 | |||
| kodim11.bmp | 6.7261 | 6.8206 | 6.7409 | 6.9128 | 7.3099 | 8.1056 | 8.2492 | 1.8411 | 4.906141 | 4.744971 | |||
| kodim12.bmp | 4.7911 | 4.8718 | 4.799 | 4.9013 | 5.342 | 6.005 | 6.0748 | 1.5161 | 3.210518 | 3.231271 | |||
| kodim13.bmp | 10.8676 | 10.9428 | 10.9023 | 11.2169 | 11.6049 | 12.7139 | 12.9978 | 4.1355 | 9.044009 | 8.513297 | |||
| kodim14.bmp | 8.3034 | 8.3883 | 8.3199 | 8.5754 | 8.8656 | 9.896 | 10.8481 | 2.4191 | 6.212482 | 6.222196 | |||
| kodim15.bmp | 5.8233 | 5.9525 | 5.8432 | 6.0189 | 6.3297 | 7.3085 | 7.4932 | 1.6236 | 4.3074 | 4.441998 | |||
| kodim16.bmp | 5.0593 | 5.1629 | 5.0595 | 5.1637 | 5.5526 | 6.3361 | 6.1592 | 1.546 | 3.476671 | 3.333637 | |||
| kodim17.bmp | 5.5019 | 5.6127 | 5.51 | 5.6362 | 6.0357 | 6.7395 | 6.8989 | 1.7166 | 4.125859 | 4.007367 | |||
| kodim18.bmp | 7.9879 | 8.0897 | 8.0034 | 8.225 | 8.6925 | 9.5357 | 9.7857 | 2.9802 | 6.743892 | 6.376692 | |||
| kodim19.bmp | 6.5715 | 6.652 | 6.5961 | 6.7445 | 7.2684 | 7.9229 | 8.0096 | 2.0518 | 4.45822 | 4.353687 | |||
| kodim20.bmp | 5.4533 | 5.5303 | 5.47 | 5.5998 | 5.9087 | 6.4878 | 6.8629 | 1.5359 | 4.190565 | 4.154571 | |||
| kodim21.bmp | 7.1318 | 7.2045 | 7.1493 | 7.3203 | 7.6764 | 8.4703 | 8.6508 | 2.0659 | 5.269787 | 5.05321 | |||
| kodim22.bmp | 6.43 | 6.5127 | 6.4444 | 6.6185 | 7.0705 | 8.0046 | 7.9488 | 2.2574 | 5.217884 | 5.142252 | |||
| kodim23.bmp | 4.8995 | 5.0098 | 4.906 | 5.0156 | 5.3789 | 6.3057 | 6.888 | 1.3954 | 3.20464 | 3.378545 | |||
| kodim24.bmp | 8.4271 | 8.5274 | 8.442 | 8.7224 | 8.9206 | 9.9389 | 10.5156 | 2.4977 | 7.618436 | 7.389021 | |||
| clegg.bmp | 14.9733 | 15.2566 | 15.1516 | 16.0477 | 15.7163 | 21.471 | 32.7192 | 10.5426 | 21.797655 | 25.199576 | |||
| FRYMIRE.bmp | 10.7184 | 12.541 | 11.9631 | 12.9719 | 12.681 | 16.7308 | 28.9283 | 6.2394 | 21.543401 | 24.225852 | |||
| LENA.bmp | 7.138 | 7.2346 | 7.1691 | 7.3897 | 7.6053 | 8.742 | 9.5143 | 4.288 | 7.936599 | 8.465576 | |||
| MONARCH.bmp | 6.5526 | 6.6292 | 6.5809 | 6.7556 | 7.0313 | 8.1053 | 8.6993 | 1.6911 | 5.880189 | 5.915117 | |||
| PEPPERS.bmp | 6.3966 | 6.5208 | 6.436 | 6.6482 | 6.9006 | 8.1855 | 8.8893 | 2.3022 | 6.15367 | 6.228315 | |||
| SAIL.bmp | 8.3233 | 8.3903 | 8.3417 | 8.5561 | 8.9823 | 9.7838 | 10.5673 | 2.9003 | 6.642762 | 6.564393 | |||
| SERRANO.bmp | 6.3508 | 6.757 | 6.5572 | 6.991 | 7.0722 | 9.0549 | 18.3631 | 4.6489 | 13.516339 | 16.036401 | |||
| TULIPS.bmp | 7.5768 | 7.656 | 7.5959 | 7.8172 | 8.0101 | 9.3817 | 10.5873 | 2.2228 | 5.963537 | 6.384049 | |||
| lena512ggg.bmp | 4.8352 | 4.915 | 4.8261 | 4.877 | 5.1986 | 6.0059 | 5.5247 | 2.054319 | 2.276361 | ||||
| lena512pink.bmp | 4.5786 | 4.6726 | 4.581 | 4.6863 | 5.0987 | 5.8064 | 5.838 | 3.653436 | 3.815336 | ||||
| lena512pink0g.bmp | 3.7476 | 3.8058 | 3.7489 | 3.8034 | 4.2756 | 5.0732 | 4.8933 | 4.091045 | 5.587278 | ||||
| linear_ramp1.BMP | 1.4045 | 2.1243 | 1.3741 | 1.6169 | 2.0939 | 2.6317 | 3.981 | 0.985808 | 0.984156 | ||||
| linear_ramp2.BMP | 1.3377 | 2.3049 | 1.3021 | 1.5617 | 1.9306 | 2.5396 | 4.0756 | 0.628664 | 0.629358 | ||||
| orange_purple.BMP | 2.9032 | 3.0685 | 2.9026 | 2.9653 | 3.2684 | 4.4123 | 7.937 | 1.471407 | 2.585087 | ||||
| pink_green.BMP | 3.2058 | 3.3679 | 3.2 | 3.2569 | 3.7949 | 4.5127 | 7.3481 | 1.247967 | 1.726312 |
11/18/2008
11-18-08 - DXTC Part 2
Here are some numbers on various DXT1 compressors. The numbers in the table are the RMSE (sqrt of the L2 error). The far right column is a wavelet compressor for comparison; it's not the best wavelet compressor in the world by a long shot, it's "cbwave" which is very old and which I designed for speed, not maximum quality. In any case it gives you an idea how far off DXT1 is. (BTW I always try to show results in RMSE because it is linear in pixel magnitude, unlike MSE or PSNR which is a very weird nonlinear scale). More discussion after the table...
(ADD: the "ryg" results shown here have a bug and should be ignored; see DXTC Post Summary )
| file | Squish opt | Squish | CB opt | CB | ryg | D3DX8 | FastDXT | cbwave | |
| kodim01.bmp | 8.2808 | 8.3553 | 8.352 | 8.6924 | 9.374 | 9.8466 | 9.9565 | 2.6068 | |
| kodim02.bmp | 6.1086 | 6.2876 | 6.1287 | 6.3025 | 7.52 | 7.4308 | 8.456 | 1.6973 | |
| kodim03.bmp | 4.7804 | 4.9181 | 4.8312 | 5.0225 | 5.855 | 6.094 | 6.4839 | 1.3405 | |
| kodim04.bmp | 5.6913 | 5.8116 | 5.7285 | 5.9394 | 6.9408 | 7.1032 | 7.3189 | 1.8076 | |
| kodim05.bmp | 9.6472 | 9.7223 | 9.707 | 10.112 | 10.8934 | 11.273 | 12.0156 | 2.9739 | |
| kodim06.bmp | 7.1472 | 7.2171 | 7.1777 | 7.44 | 8.1005 | 8.5195 | 8.6202 | 2.0132 | |
| kodim07.bmp | 5.7804 | 5.8834 | 5.8379 | 6.0583 | 6.8153 | 7.2182 | 7.372 | 1.4645 | |
| kodim08.bmp | 10.2391 | 10.3212 | 10.346 | 10.747 | 11.3992 | 11.8703 | 12.2668 | 3.2936 | |
| kodim09.bmp | 5.2871 | 5.3659 | 5.3306 | 5.5234 | 5.9884 | 6.5332 | 6.6716 | 1.6269 | |
| kodim10.bmp | 5.2415 | 5.3366 | 5.2777 | 5.4633 | 5.9377 | 6.4601 | 6.4592 | 1.7459 | |
| kodim11.bmp | 6.7261 | 6.8206 | 6.7643 | 7.0216 | 7.8221 | 8.1056 | 8.2492 | 1.8411 | |
| kodim12.bmp | 4.7911 | 4.8718 | 4.8204 | 4.9863 | 5.6651 | 6.005 | 6.0748 | 1.5161 | |
| kodim13.bmp | 10.8676 | 10.9428 | 10.925 | 11.4237 | 12.402 | 12.7139 | 12.9978 | 4.1355 | |
| kodim14.bmp | 8.3034 | 8.3883 | 8.3398 | 8.6722 | 9.4258 | 9.896 | 10.8481 | 2.4191 | |
| kodim15.bmp | 5.8233 | 5.9525 | 5.8568 | 6.0862 | 6.6749 | 7.3085 | 7.4932 | 1.6236 | |
| kodim16.bmp | 5.0593 | 5.1629 | 5.0863 | 5.2851 | 5.8093 | 6.3361 | 6.1592 | 1.546 | |
| kodim17.bmp | 5.5019 | 5.6127 | 5.5313 | 5.7358 | 6.4975 | 6.7395 | 6.8989 | 1.7166 | |
| kodim18.bmp | 7.9879 | 8.0897 | 8.0192 | 8.3716 | 9.7744 | 9.5357 | 9.7857 | 2.9802 | |
| kodim19.bmp | 6.5715 | 6.652 | 6.6692 | 6.91 | 8.0128 | 7.9229 | 8.0096 | 2.0518 | |
| kodim20.bmp | 5.4533 | 5.5303 | 5.4895 | 5.6864 | 6.3457 | 6.4878 | 6.8629 | 1.5359 | |
| kodim21.bmp | 7.1318 | 7.2045 | 7.1724 | 7.4582 | 8.1637 | 8.4703 | 8.6508 | 2.0659 | |
| kodim22.bmp | 6.43 | 6.5127 | 6.4644 | 6.7137 | 7.8264 | 8.0046 | 7.9488 | 2.2574 | |
| kodim23.bmp | 4.8995 | 5.0098 | 4.9244 | 5.0906 | 5.6989 | 6.3057 | 6.888 | 1.3954 | |
| kodim24.bmp | 8.4271 | 8.5274 | 8.4699 | 8.8564 | 9.3906 | 9.9389 | 10.5156 | 2.4977 | |
| clegg.bmp | 14.9733 | 15.2566 | 15.1755 | 15.7168 | 16.3563 | 21.471 | 32.7192 | 10.5426 | |
| FRYMIRE.bmp | 10.7184 | 12.541 | 12.132 | 12.8278 | 12.989 | 16.7308 | 28.9283 | 6.2394 | |
| LENA.bmp | 7.138 | 7.2346 | 7.1763 | 7.4264 | 8.1203 | 8.742 | 9.5143 | 4.288 | |
| MONARCH.bmp | 6.5526 | 6.6292 | 6.5949 | 6.846 | 7.5162 | 8.1053 | 8.6993 | 1.6911 | |
| PEPPERS.bmp | 6.3966 | 6.5208 | 6.4557 | 6.677 | 7.3618 | 8.1855 | 8.8893 | 2.3022 | |
| SAIL.bmp | 8.3233 | 8.3903 | 8.3598 | 8.6627 | 9.8685 | 9.7838 | 10.5673 | 2.9003 | |
| SERRANO.bmp | 6.3508 | 6.757 | 6.8385 | 7.9064 | 7.5303 | 9.0549 | 18.3631 | 4.6489 | |
| TULIPS.bmp | 7.5768 | 7.656 | 7.6146 | 7.8786 | 8.4084 | 9.3817 | 10.5873 | 2.2228 |
Back to comparing the DXT1 encoders. BTW the test set here is the Kodak image set plus the Waterloo Bragzone image set. The Kodak set is all photographs that are pretty noisy, and there's not a huge difference in the coders. The Bragzone image set has some synthetic images with things like gradients which are harder to compress well, and there you can really dramatically see the bad encoders fall apart. In particular if you look at the results on "clegg" and "frymire" and "serrano" you can see how bad the "FastDXT" coder is.
The "Squish" in the table is the iterative cluster fit with uniform weighting. All coders work on linear RGB error; and the MSE is mean per pixel not per component.
The "CB" encoder is a simple 2-means fit. I seed the means by doing the PCA to find the best fit line. I put a plane perpendicular to that line through the average and take all the points on each half to be the two clusters, average the cluster to seed the means, and then iteratively refine the means by reclustering. Once I have the 2-means I do a simple search to find the best 565 DXT end points to find the two means. There are 3 cases to try :
1. put c0 and c1 = the two means 2. put c2 and c3 = the two means (so c0 and c1 are pushed out) 3. make (c0,c2) straddle mean 1 and (c3,c1) straddle mean 2 - this is best for gaussian clusters around the mean
A 2-means like this is slightly better than doing a pca "range fit" like Simon's fast method or the "ryg" method. If the data was all Gaussian noise, they would be equivalent, but of course it's not. You often get blocks that have a bunch of pixels at the low end which are all exactly the same color ( for example, all perfect black), and then a spread of a bunch of junk at the high end (like some orange, some yellow, etc.). You want to put one end point exactly on perfectly black and put the other endpoint at the center of the cluster on the high end.
"CB opt" and "Squish opt" take the results of the CB and Squish compressors and then improve them by iterative search. Simon Brown on his page mentions something about doing greedy endpoint refinement but claims it "doesn't converge because the indeces change". That's silly, of course it converges.
To do a greedy search : try wiggling one or both end points in 565 space. Find new best index fit for the new end points. Measure new error. If the error is lower, take the step.
Of course that works and it does converge and it's pretty simple and improves the encoding after Squish.
In fact you can do even better by doing simulated annealing instead of a plain greedy search. We should know by now that any time we have a greedy search like this where we can measure the error and it can get stuck in local minima, we can improve it with something like simulated annealing. I use 256 steps of annealing with a sinusoidal decreasing temperature pattern. I randomly pick a way to wiggle the two end points (you need to consider wiggling both, not just single wiggles). I try the wiggle and measure the error delta. Negative errors (improvements) are always taken, positive errors are taken probabilitistically based on the temperature. This is what was done in the "opt" coders above.
Most of the time the annealing search just makes a small improvement, but once in a while (such as on "frymire" and "serrano") it really finds something remarkable and makes a big difference.
Simon's cluster fit alg is very sweet. I didn't really understand it at first from the description of the code, so I'm going to redescribe it for myself here.
The basic idea is you find an assignment of indices, and from that solve a best fit to give you the end points for those indices. If you think about all the colors in the block living in 3x3 color space, assigning indices is like giving labels to each point. All the points with the same label are a "cluster". Then each cluster is fit with either an end point or one of the interpolated points.
Once you know the clusters, putting the best two end points is a simple linear optimization. You can solve the least squares analytically, it's not like a matrix iteration least squares problem or anything.
So the trick is how to decide the indices. If you tried them all, it would be 4^16 ways, which is way too many. So what Simon does is create a linear ordering of the points using the PCA axis, then try all clusterings that can be created by planes perpendicular to that linear axis. That's equivalent to just trying all groupings of the indeces sorted by their dot along the PCA axis. That is, the clusters are
[0,i) , [i,j) [j,k) [k,16) for all i,j,kwhich is a manageable amount to search, and gives you most of the interesting clusterings that you care about. Something that might improve Squish is tring a few different axes and picking the best.
BTW this end point optimization is very approximate. One issue is that it assumes the best index for each point doesn't change. It also of course just uses real number arithmetic to make the 1/3 points, not the actual integer rounding that's done. Those factors are actually pretty big.